DNA Has a Language. A 40-Billion-Parameter Model Has Now Learned to Speak It

All of life, every bacterium, every oak tree, every nervous system that has ever fired, encodes its instructions in the same four-letter alphabet. We have been reading that alphabet, haltingly, for decades. Now a team of researchers at Arc Institute, Stanford, UC Berkeley and UCSF has built something that might, in a limited but genuinely significant sense, understand it.

The model is called Evo 2. Published this week in Nature, it was trained on 9 trillion DNA base pairs drawn from across all domains of life, bacteria, archaea, fungi, plants, animals, humans, and the viruses that infect them (or most of them, more on that). At 40 billion parameters, it’s one of the largest fully open biological AI models ever released. What sets it apart from earlier genomic models is not just scale, but scope: Evo 2 reads DNA the way a language model reads text, learning statistical patterns across the full diversity of life rather than within a single species or a single molecular register.

The practical upshot, and it’s worth sitting with, is that Evo 2 can make predictions about genetic variants it has never been explicitly trained to assess. Clinicians trying to interpret an unfamiliar BRCA1 mutation, for instance, currently rely on supervised tools that must be trained on labelled datasets for each gene. Evo 2 does something different. Presented with a variant in a noncoding region of BRCA1, a stretch of DNA that most older tools struggle with entirely, it can estimate whether the change is likely harmful, drawing on patterns it absorbed from genomes across the whole tree of life. On noncoding BRCA1 variants it outperformed every model tested, including supervised predictors specifically built for splicing. A classifier trained on Evo 2 embeddings achieved an AUROC of 0.95 on the BRCA1 test set. That’s not clinical deployment; that’s a laboratory benchmark. But it’s a compelling one.

Harder to overstate than the clinical angle is what the model seems to have learned without being told.

The team applied a technique called sparse autoencoder analysis to probe what’s going on inside the model’s representations. What they found, roughly, was that Evo 2 had grown internal features corresponding to actual biological structures: exon-intron boundaries, transcription factor binding sites, protein secondary structures, the signature of prophage DNA inserted into bacterial genomes. One feature activated specifically on spacer sequences within CRISPR arrays, the genetic “memory” bacteria use to recognise previous viral invaders. The model had evidently worked out, from sequence statistics alone, that those spacers are related to phage DNA. Nobody told it. These features also transferred across species; trained partly on primates and mouse genomes, the same features correctly annotated a gene in a woolly mammoth.

In-context learning was another surprise. This property, the ability to use examples embedded in a prompt to improve predictions without any retraining, was long assumed to be a quirk of models trained on human language, something about the recursive, referential structure of text. Evo 2, trained on genomes and nothing else, also does it. It outperformed similarly-scaled general-purpose language models on in-context learning benchmarks. Which is, if you think about it for a moment, a rather destabilising result for how we understand what language is.

The generative side of the work is where things get perhaps a little vertiginous. Prompted with a few thousand base pairs from the human mitochondrial genome, Evo 2 generated more than 250 synthetic 16-kilobase sequences. When annotated with standard genomic tools, these contained the correct complement of coding sequences, transfer RNA genes and ribosomal RNA genes expected in human mitochondria. The team also prompted the model with a segment from Mycoplasma genitalium, a bacterium with one of the smallest known genomes, and generated ten synthetic 580-kilobase sequences. Nearly 70 percent of annotated genes in those synthetic genomes matched real protein families in databases. The previous Evo 1 model managed 18 percent.

None of this means the synthesised genomes would replicate if you put them in a cell. Probably they wouldn’t. Evaluating artificial genomes requires laborious, expensive laboratory work, and nobody has yet tested a Evo 2-generated chromosome for actual viability. What has been experimentally validated is something more targeted and, in its way, oddly poetic. Using Evo 2 in combination with chromatin accessibility prediction models, the team designed custom DNA sequences and inserted them into mouse embryonic stem cells, then measured where the DNA opened up for transcription. The designed sequences spelt out Morse code messages in the cell’s own chromatin. “ARC”, “EVO2”, “LO.” The predicted and experimentally measured patterns matched with AUROCs of 0.92 to 0.95. It’s a proof of concept, not a therapy. But writing arbitrary patterns into the epigenome of living cells, at kilobase scale, with this level of accuracy, is not nothing.

There is also the bacteriophage work. Antibiotic resistance is a slowly worsening crisis: bacteria evolve resistance to drugs fast, and the pipeline of new antibiotics is thin. Phage therapy, which uses viruses that target specific bacteria, is a potential alternative, but designing custom phages is currently a slow, expensive, largely artisanal process. The team fine-tuned Evo 2 on thousands of phage genomes and generated 285 synthetic designs. Sixteen of them successfully replicated and suppressed the target bacteria while leaving unrelated bacterial strains unharmed. About 5.6 percent. Not, on its face, an impressive conversion rate, but the designs had never existed before, and the model that produced them had been trained for a matter of months.

The obvious concern with a model that fluently speaks the language of genomes is biosecurity. Specifically: could someone use it to design dangerous viruses? The team addressed this directly, and their approach is worth understanding. Genomic sequences from viruses that infect eukaryotes were excluded from Evo 2’s training data, which the researchers then verified had the intended effect: the model performs poorly on eukaryotic viral sequences, no better at certain tasks than random chance. Red-teaming attempts to elicit pathogenic viral proteins produced sequences that were, in the team’s own characterisation, “effectively random.” This is a meaningful mitigation but not a permanent guarantee; as the researchers acknowledge themselves, model alignment for biosafety in biological foundation models will require ongoing, active effort as these systems grow more capable.

All of this, the model weights, the training code, the OpenGenome2 dataset of 8.8 trillion nucleotides, has been released publicly. More than 88,000 downloads on GitHub in the year since the preprint appeared. Over 8 million API requests. The decision to open-source a 40-billion-parameter biological foundation model is genuinely unusual, and slightly vertiginous in its own right. The standard argument is that open access accelerates science; the counterargument is that it accelerates everything. In this case the team has wagered that the benefits of open science outweigh the risks, having taken meaningful steps to constrain the most obvious misuse vectors. We’ll see.

What’s harder to argue with is that something has changed. For sixty years, genomics has been primarily a science of reading: sequencing, annotating, cataloguing variation. Evo 2, even in this early, imperfect form, points toward something more like authorship.

DOI: https://doi.org/10.1038/s41586-026-10176-5

Quick Note Before You Read On.

ScienceBlog.com has no paywalls, no sponsored content, and no agenda beyond getting the science right. Every story here is written to inform, not to impress an advertiser or push a point of view.

Good science journalism takes time — reading the papers, checking the claims, finding researchers who can put findings in context. We do that work because we think it matters.

If you find this site useful, consider supporting it with a donation. Even a few dollars a month helps keep the coverage independent and free for everyone.

DNA Has a Language. A 40-Billion-Parameter Model Has Now Learned to Speak It

Related

Leave a Comment Cancel reply