Anyone who has tried to organize years of digital photos knows the sinking feeling: storage full, and no clear way to sort through the mess. Geneticists face a version of this problem on a scale most of us cannot fathom. The cost of reading a genome has collapsed so fast that researchers now drown in data they can barely store, let alone study. The tools built to compare genetic sequences were designed for dozens of genomes, maybe hundreds. They choke when asked to handle millions.
The field of pangenomics depends on keeping all that information intact. Studying entire collections of genomes from a single species captures every twist of evolution, but the storage demands are punishing. Lose detail to save space, and you might miss the mutation that explains why one viral strain spreads faster than another. Keep everything, and your hard drives surrender.
Engineers at the University of California, San Diego, think they have found a way out. In research published this month in Nature Genetics, a team led by Yatish Turakhia describes a data structure called a Pangenome Mutation-Annotated Network, or PanMAN, that compresses genomic information by exploiting shared ancestry. Instead of storing each genome as a separate file, the system records a single ancestral sequence and notes mutations only once, at the exact point on the family tree where they first appeared.
Storing edits, not duplicates
The trick is treating genomes less like independent documents and more like drafts of the same manuscript. Closely related sequences share most of their history, so storing them separately means duplicating enormous amounts of identical information. PanMAN avoids this by encoding what changed and when, preserving the evolutionary narrative rather than flattening it into static files.
To demonstrate the approach, the researchers assembled the largest pangenome ever built for SARS-CoV-2, pulling together more than eight million publicly available viral sequences. A traditional alignment of that dataset would demand a staggering amount of storage. The PanMAN version fit into 366 megabytes, a reduction of more than 3,000-fold.
“Our compressive technique with PanMANs allows doing more with less, greatly improving the scale and scope of current pangenomic analysis,” Turakhia said.
The format handles messy biology, too. Genes do not always pass neatly from parent to offspring; bacteria swap DNA sideways, and viruses recombine. PanMAN uses network edges to capture these events, representing complex mutations that simpler tools ignore or discard. In tests across six microbial species, it outperformed existing formats by factors sometimes exceeding 1,300. Some older software simply crashed when fed the SARS-CoV-2 dataset.
The human genome is next
Viruses and bacteria were the proving ground. The team is already aiming at a far larger target: human genetic data, which dwarf anything a coronavirus can offer. By embedding metadata like collection dates and geographic locations directly in the network, researchers could trace how a mutation travels through a population in something close to real time.
For now, the file sits there: 366 megabytes, eight million viral genomes, waiting on a server for whoever needs it.
Nature Genetics: 10.1038/s41588-025-02478-7
ScienceBlog.com has no paywalls, no sponsored content, and no agenda beyond getting the science right. Every story here is written to inform, not to impress an advertiser or push a point of view.
Good science journalism takes time — reading the papers, checking the claims, finding researchers who can put findings in context. We do that work because we think it matters.
If you find this site useful, consider supporting it with a donation. Even a few dollars a month helps keep the coverage independent and free for everyone.
