An AI Trained on Fake Genomes Can Read the Geneology Written in Real Ones

Run an AI model across a human chromosome and what you get back looks like a seismograph readout: peaks and troughs, quiet stretches interrupted by jagged eruptions of ancient time. The quiet zones are where evolution swept through recently, pushing all the variants in a population toward a single common ancestor. The peaks are where something else happened, something older and stranger, where competing versions of a gene have been kept alive for tens of millions of years, locked in evolutionary stalemate. The model reads all of this in minutes.

That model is called cxt, and it represents something genuinely new in population genetics: the first language model designed not to predict text, but to predict time. Specifically, the moment at which any two stretches of DNA last shared a common ancestor.

The architecture will be familiar to anyone who has followed the rise of large language models. Andrew Kern and his colleagues at the University of Oregon started with GPT-2, the older transformer model that underpins ChatGPT, and stripped out everything that makes it useful for predicting words. In its place, they fed it something else entirely: millions of simulated genomes, generated under explicit evolutionary models spanning bacteria, rodents, mosquitoes, and primates. The model learned, across all those fabricated histories, to recognize the mutational fingerprints that accumulate along DNA when lineages diverge, recombine, and drift apart over time. Genomes really do work like language in one key respect. The four-letter alphabet of DNA, A, T, C and G, accumulates typos across generations, and those typos leave a trail that encodes ancestry.

Learning the Grammar of Ancestry

The task they set it is called next-coalescence prediction. As the model scans a chromosome from left to right, it asks at each window: given the pattern of mutations visible here, how long ago did this pair of sequences last share a common ancestor? It answers by producing a probability distribution, sampling from that distribution, then feeding the result back into its own context before moving to the next window. The whole process is autoregressive, meaning each prediction shapes the next, the same basic trick that makes large language models coherent across a paragraph rather than word-by-word.

In test after test, cxt matched the accuracy of methods that population geneticists have spent decades refining, approaches built on careful statistical reasoning about genealogical relationships in recombining genomes. “You never really know what’s going to work when you’re essentially borrowing techniques from a totally different world and applying them to a new problem,” said Kern, an Evergreen professor of biology. “But this was a case where things worked really well.”

Speed is where the comparison gets uncomfortable for classical methods. A single mosquito chromosome that might take hours (or days) with likelihood-based inference, cxt processes in minutes on a standard GPU. The reason is structural. “Compared to classical inferential approaches, the AI tool doesn’t have to reason about every mutation individually,” explained Kevin Korfmann, lead author and former postdoctoral researcher in Kern’s lab. “It just reads the patterns because all of the expensive statistical work was done up front, during training, which sidesteps the bottleneck.”

This matters for a reason that isn’t immediately obvious. Classical statistical methods are, in a sense, genuinely intelligent: they reason about each new dataset from first principles, asking what combination of evolutionary forces could plausibly produce what they’re seeing. cxt does something more like recognition. It has processed so many simulated evolutionary scenarios during training that it matches observed patterns to learned templates without working through the logic anew. That sounds like cheating, until you examine what the model was trained on. Korfmann and colleagues used a community resource called stdpopsim, a standardised catalog of genome simulations spanning dozens of species and demographic histories, covering population bottlenecks, expansions, ancient divergences, and complex demographic shifts. The model’s implicit prior is not a guess; it’s a compression of much of what evolutionary biology knows about how genomes change. “We can’t repeat evolution,” Korfmann noted, “so one of the key workflows we have is developing simulations. The simulations mimic evolutionary processes, and then we use the outcomes as training data for our deep learning models.”

Mosquitoes, Missing Data, and the Geography of Resistance

There are limits, naturally. When cxt encounters species or parameter regimes outside its training distribution, accuracy degrades, sometimes substantially. The leading MCMC-based alternative, a method called Singer, still outperforms cxt in most out-of-sample tests and does so without needing fine-tuning. For researchers who aren’t constrained by computing time, it remains the more accurate choice.

Fine-tuning, though, is what makes the approach tractable at scale. When Kern’s team applied cxt to mosquito genetic data riddled with missing sequence, a chronic problem with field-collected specimens, they trained a lightweight adapter on simulations that explicitly mimicked the missingness patterns in the real data. The classical methods, meanwhile, generated spurious coalescence signals in exactly those missing regions, artefacts of miscalibrated mutation-rate expectations. cxt, having been trained to expect gaps, wasn’t fooled.

The mosquito results are where the practical stakes become sharpest. “Insecticide resistance is being observed in all of these mosquito populations today,” Kern said. “A major challenge in preventing the spread of malaria has been understanding the evolution of insecticide resistance. Now, we can go in with our AI model, ask how long ago these resistance genes arose in the population, and learn about the evolutionary history of this critical carrier of malaria.” In the Anopheles data, cxt found precisely that signal at the Rdl locus, which confers resistance to dieldrin and related insecticides. Coalescence times at Rdl were strikingly recent in West African populations where resistance is already common, but indistinguishable from background in Uganda, where resistant alleles remain rare. The geography of an evolving epidemic, visible in the timing of a shared ancestor. Some of the youngest estimates at Rdl predate the mass use of insecticides by centuries, suggesting the resistance mutation was already circulating at low frequency before selection gave it lift.

The human genome data told a complementary story. At the LCT locus, where a mutation enabling adults to digest milk spread rapidly across European populations around 5,000 to 10,000 years ago, cxt recovered the expected trough in coalescence times with good precision. At the HLA region, which encodes immune genes under long-term balancing selection, it found the opposite: coalescence times for some gene pairs exceeding tens of millions of years, predating the split between humans and chimpanzees. The model swept across the full dynamic range of human evolutionary time in a single chromosome scan. “There’s so much going on in the machine learning field that we haven’t applied yet in our field,” Korfmann said. The stdpopsim catalog keeps expanding; every new species, every new demographic model, is more training data, a richer library of evolutionary pattern for the model to learn. The seismograph readout gets sharper with each addition.

DOI: 10.1073/pnas.2518956123

Frequently Asked Questions

Why can’t scientists just read DNA directly to figure out evolutionary history?

DNA sequences show you the current state of a genome, but not the historical branching structure that produced it. Reconstructing ancestry requires inferring an invisible family tree, called the ancestral recombination graph, from indirect evidence: patterns of shared mutations scattered across millions of base pairs. The challenge is that mutations accumulate noisily and recombination constantly shuffles genetic material, making the true history hard to recover without powerful statistical or machine learning methods.

How is this different from other DNA-reading AI tools like AlphaFold or Evo?

Most DNA-focused AI models are trained to predict biological function, such as protein structure or gene expression. cxt does something different: it ignores what the DNA does and focuses entirely on when it diverged. Rather than asking “what does this sequence encode?”, it asks “how long ago did these two sequences last share a common ancestor?” That makes it a tool for evolutionary history rather than functional genomics, and it’s the first language model designed specifically for that purpose.

Could this help track how antibiotic resistance spreads in bacteria?

In principle, yes. The method was already trained on bacterial simulations alongside mammals and mosquitoes, so the groundwork is there. Tracing when a resistance gene emerged in a bacterial population, and whether it spread once or arose multiple times independently, is exactly the kind of question cxt is designed to address. Whether the model’s accuracy holds under the specific genomic features of rapidly evolving bacteria would need to be tested, but it’s among the more tractable extensions on the horizon.

What’s stopping this from replacing traditional methods entirely?

Accuracy, in some settings. When applied to species or evolutionary scenarios that fall outside its training data, cxt’s performance declines more than that of classical likelihood-based methods, which can reason about novel scenarios from first principles. For well-resourced labs with time to run compute-intensive analyses, the classical approaches still have an edge. cxt’s advantage is speed and flexibility, particularly for large datasets, missing data, and genomic regions where traditional methods generate artefacts.

Why does it matter when insecticide resistance genes first appeared?

Insecticides have been used to control malaria-carrying mosquitoes for decades, but resistance has emerged repeatedly across Africa, often faster than surveillance can track. Knowing whether a resistance allele appeared once and spread, or arose independently in multiple populations, changes the strategy for containing it. If resistance evolved locally, slowing its spread within a region may be viable; if it’s being carried across populations by migration, the response needs to be coordinated at a larger scale. Coalescence timing gives public health researchers a way to reconstruct that history from existing genomic data.

Quick Note Before You Read On.

ScienceBlog.com has no paywalls, no sponsored content, and no agenda beyond getting the science right. Every story here is written to inform, not to impress an advertiser or push a point of view.

Good science journalism takes time — reading the papers, checking the claims, finding researchers who can put findings in context. We do that work because we think it matters.

If you find this site useful, consider supporting it with a donation. Even a few dollars a month helps keep the coverage independent and free for everyone.

An AI Trained on Fake Genomes Can Read the Geneology Written in Real Ones

Learning the Grammar of Ancestry

Mosquitoes, Missing Data, and the Geography of Resistance

Frequently Asked Questions

Related

Leave a Comment Cancel reply