New! Sign up for our email newsletter on Substack.

Scientists Crack the Rules of Gene Regulation

Your DNA contains roughly 20,000 genes, each one a recipe for building proteins. But here’s the thing we didn’t understand until recently: how does a cell know which recipes to use, when to use them, and at what intensity? The instructions exist somewhere in the vast stretches of DNA between genes—regulatory sequences that act as the genome’s operating system. Only, we couldn’t read them.

That changed last week. Within days of each other, two research teams published AI models capable of predicting how genetic mutations affect gene regulation. One, from Google’s DeepMind, is a computational colossus processing megabases of DNA. The other, from Dutch researchers, requires only a petri dish of cells and a day’s computing time. Together, they mark the moment when we finally learned to read the genome’s control code.

The problem they’ve cracked is deceptively simple. We’ve known the genetic code—how DNA spells out proteins—since the 1960s. But most cancer-causing mutations aren’t in genes themselves. They’re in the regulatory regions, the switches and dials that control gene activity. “The classical genetic code explains how genes in our DNA encode proteins,” says Bas van Steensel at the Netherlands Cancer Institute. “But for most genes, we honestly didn’t understand how they are regulated.”

Between your genes lie promoters, enhancers, silencers—regulatory elements that together decide whether a gene activates, in which cell type, and how strongly. A mutation might strengthen an enhancer, flooding a cell with an oncogene. Or it might disrupt a promoter, silencing a tumor suppressor. Until now, predicting these effects meant painstaking lab work, testing mutations one at a time.

AlphaGenome, published in Nature last Wednesday, takes the maximalist approach. Feed it a million base pairs of DNA sequence—about 0.03 percent of your genome—and it predicts nearly 6,000 different measurements across human cells. Gene expression levels, splice sites where RNA gets edited, regions where chromosomes fold together, spots where proteins bind to DNA. The model outperformed existing tools on 25 of 26 variant prediction tasks, from splicing defects to changes in gene expression.

It’s the AI equivalent of satellite imagery: comprehensive, extraordinarily detailed, and requiring Google-scale computing resources to run. The architecture spans eight interconnected processors working in parallel, combining convolutional layers for local sequence patterns with transformer blocks for long-range interactions between distant regulatory elements.

Perhaps most impressively, AlphaGenome can score a variant’s effects across multiple mechanisms simultaneously—showing, for instance, how a single mutation near the TAL1 oncogene simultaneously disrupts transcription factor binding, alters chromatin accessibility, and changes histone modification patterns. These multimodal predictions successfully recapitulated the mechanisms of cancer-causing mutations characterized in acute lymphoblastic leukaemia patients.

Enter PARM, from van Steensel’s lab and collaborators across the Netherlands. Also published this week, it takes the exact opposite approach: lightweight, targeted, experimentally driven. “Most AI models learn from whatever data happens to exist,” explains Jeroen de Ridder at UMC Utrecht. “Here, the measurements and the AI were designed together.”

The difference is computational efficiency. Whilst AlphaGenome requires enormous processing power, PARM was explicitly designed for academic labs without supercomputers. The team developed a technology that generates millions of carefully controlled measurements of how DNA sequences influence gene activity, then trained AI models specifically on that data. The result is a tool that runs on ordinary computers, processes single cell types, and delivers results quickly enough for clinical applications.

Van Steensel is diplomatic about the comparison. “This is a great model,” he says of AlphaGenome. “However, PARM is more flexible and it is experimentally and computationally lightweight.” The numbers are striking: PARM requires roughly a thousand times less computing power than AlphaGenome. “With this model you only need one petri dish of cells and one day of computing to see in detail how a particular cell type, such as a tumor cell, uses its DNA code to respond to a signal such as a hormone, nutrient or drug.”

There’s a deeper contrast here, beyond computational resources. AlphaGenome aims for comprehensive coverage—predicting everything about any DNA sequence you feed it. PARM is customizable—train it on your specific cell type, your particular research question, your tumor cells. One model for all contexts versus many models, each optimized for specific applications.

Both approaches were subjected to rigorous testing, but in rather different ways. AlphaGenome’s creators at DeepMind evaluated their model across 26 distinct variant prediction tasks, comparing against the strongest available method for each. On quantitative trait loci—genetic variants associated with measurable molecular changes—AlphaGenome improved accuracy by 25.5 percent for expression effects and 8 percent for accessibility changes.

For splicing predictions, where mutations alter how genes are edited into mature RNA, AlphaGenome achieved the highest performance on six of seven benchmarks, including both supervised and unsupervised prediction of rare variants that disrupt splicing. The model accurately predicted effects on alternative polyadenylation, the process that determines where RNA molecules end—a mechanism that influences RNA stability and can contribute to disease when disrupted.

The Dutch team’s testing strategy differed. Every prediction PARM makes gets experimentally verified, ensuring the model genuinely captures biological reality rather than statistical artifacts. This iterative approach—predict, test, refine, repeat—built confidence that PARM’s regulatory grammar reflects actual mechanisms.

Neither model is perfect. AlphaGenome, for all its power, struggles with very distant regulatory elements beyond 100 kilobases from target genes. Accurately predicting tissue-specific effects remains challenging. The model was trained primarily on protein-coding genes, leaving coverage gaps for microRNAs and other non-coding RNA.

PARM’s limitations are different. Its efficiency comes from specialization: you must train separate models for different cell types. The tool also can’t yet handle personal genome prediction across all the variants an individual carries, a known weakness of sequence-to-function models.

What neither limitation undermines is the fundamental achievement: reading the regulatory code. “We can now actually read the language of the gene control system,” says van Steensel. “Our PARM model allows us to uncover these rules at scale, so we can now understand, and even predict, how regulatory DNA controls gene activity.”

The implications ripple outward. Most genetic variants identified in disease studies—thousands of them—sit in regulatory regions. We’ve known they matter; we couldn’t predict how. Now we can score their effects on chromatin accessibility, transcription factor binding, gene expression. Each prediction becomes a hypothesis for experimental follow-up.

Cancer diagnostics gain a new tool for patient stratification, grouping tumors not just by mutated genes but by their regulatory disruptions. Rare disease diagnosis expands beyond protein-coding mutations to regulatory variants of uncertain significance. Therapeutic development, from antisense oligonucleotides to enhancer-targeted treatments, can be designed in silico before expensive lab validation.

There’s something almost ironic about the timing. Google published AlphaGenome mere days before the Dutch team’s PARM paper appeared, creating an accidental scientific photo finish. One represents corporate AI at its most ambitious—massive data, enormous models, comprehensive predictions. The other embodies academic innovation: clever experimental design, efficiency through specialization, tools built for widespread use.

We don’t have to choose. Rare disease geneticists might start with AlphaGenome’s broad predictions, then turn to PARM for detailed analysis of candidate variants in patient-derived cells. Cancer researchers could use AlphaGenome to survey regulatory landscapes, then deploy PARM for mechanism-specific follow-up. The tools are complementary, different windows into the same regulatory code.

About 98 percent of human genetic variation is non-coding, sitting in those mysterious regulatory regions we couldn’t interpret. We’ve sequenced genomes for two decades whilst remaining largely illiterate in the genome’s control language. Now, suddenly, we’re reading it—not perfectly, not completely, but well enough to start making sense of variants that have puzzled us for years.

The genome contains two codes, layered atop each other. The genetic code, cracked in the sixties, tells us what proteins cells can build. The regulatory code, finally yielding to AI, tells us when and where those proteins actually get made. It took sixty years between the two breakthroughs, which says something about the second code’s complexity. But perhaps also about needing the right tools for the job: not just sequencers and biochemistry, but machine learning trained on millions of regulatory measurements.

Van Steensel’s team collaborated across seven research groups within the Netherlands Oncode Institute—a reminder that even in the age of Google’s AI dominance, academic consortia remain formidable. They’ve made PARM accessible to researchers worldwide, betting that distributed innovation will outpace centralized development. Google, meanwhile, provides AlphaGenome through an online API, democratizing access to predictions that would otherwise require supercomputers.

The next frontier is obvious: personal genome interpretation. We can sequence an individual’s complete DNA for roughly $500 now. What we can’t do reliably is interpret all those regulatory variants each person carries—hundreds of thousands of them, most with unknown effects. Models like AlphaGenome and PARM bring that goal closer, though significant challenges remain around accuracy, calibration, and understanding gene-to-disease connections beyond molecular effects.

There’s also the question of what happens when regulatory predictions get integrated with protein structure prediction, drug design AI, and other biological machine learning tools. We’re assembling a computational microscope for looking at biological systems, one lens at a time. The regulatory code was a crucial missing piece.

For now, though, there’s something satisfying about closing a loop opened in 1953, when Watson and Crick described DNA’s structure. They recognized immediately that base pairing suggested a copying mechanism—how genetic information gets transmitted. But the control mechanism, how cells regulate which genes to use, remained opaque for another seven decades. With these new tools, we’re finally learning to read both codes our genome writes in.

Study link: https://www.nature.com/articles/s41586-025-10014-0


Quick Note Before You Read On.

ScienceBlog.com has no paywalls, no sponsored content, and no agenda beyond getting the science right. Every story here is written to inform, not to impress an advertiser or push a point of view.

Good science journalism takes time — reading the papers, checking the claims, finding researchers who can put findings in context. We do that work because we think it matters.

If you find this site useful, consider supporting it with a donation. Even a few dollars a month helps keep the coverage independent and free for everyone.


Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.