New Pangenomic Approach Enables Unprecedented Scaling for Genetic Data

Researchers at the University of California have created a novel data structure and compression method that allows pangenomics to process genetic information at previously unattainable scales. The group detailed their compressive pangenomics methodology in Nature Genetics.

New Pangenomic Approach Enables Unprecedented Scaling for Genetic Data
A new data structure and compression technique developed by engineers at UC San Diego enables the field of pangenomics to handle unprecedented scales of genetic information. Image Credit: Illustration by Alice Grishchenko

Pangenomics is a branch of bioinformatics that examines numerous genomes belonging to a single species. This approach delivers a broader understanding of natural diversity and genetic changes within a species compared with reliance on a single reference genome. It supports multiple real-world uses, including investigating how genetic mutations contribute to greater transmissibility or drug resistance in pathogens across diverse biological research contexts.

Despite improvements in genome sequencing technologies that have lowered costs and accelerated sequencing speed, the data structures and analytical tools required to examine and visually depict relationships among millions of sequenced genomes continue to pose difficulties.

Although graph-based data formats for pangenomes are now common and widely used, they capture only the genetic variation across a genome collection, rather than their shared evolutionary and mutational histories. In addition, these formats demand substantial storage capacity that does not scale efficiently.

The data structures used for pangenomics research are critical because they determine not only how efficiently genetic data is represented, but also what the data can represent.

Sumit Walia, Study Co-First Author and PhD Candidate, Electrical Engineering, Jacobs School of Engineering

The research group introduced a novel data structure and file format known as the Pangenome Mutation-Annotated Network (PanMAN). PanMAN delivers exceptional compression for pangenomes while also greatly enhancing representational capability by encoding added biologically meaningful information, including phylogenetic relationships, mutations, and complete genome alignments.

This compressive pangenomics approach enables analysis to be carried out directly on compressed pangenomic data, allowing researchers to work with genetic datasets at far larger scales than are currently achievable.

Our compressive technique with PanMANs allows doing more with less, greatly improving the scale and scope of current pangenomic analysis.

Yatish Turakhia, Study Corresponding Author, University of California San Diego

PanMANs consist of mutation-annotated trees, known as PanMATs, which retain a single ancestral genome sequence at the root and record mutations, such as substitutions, insertions, and deletions, along individual branches. Multiple PanMATs are linked together as a network through edges to form a PanMAN.

These edges capture complex mutation events, including recombination and horizontal gene transfer, which generate sequences derived from multiple parent genomes and break the assumption of strictly vertical inheritance found in single trees.

This model is highly space-efficient because it leverages shared ancestry across genomes, recording each mutation only once at the branch where it originated rather than repeating it across separate sequences.

In addition, PanMAN was designed to capture a broad range of biologically meaningful information that existing pangenome formats do not include. Specific details in PanMAN are stored directly, including mutations, phylogeny, annotations, and the root sequence. Other information, such as ancestral sequences, multiple whole-genome alignments, and genetic variation, can be inferred.

To date, the researchers have applied PanMAN to the analysis of microbial genomes. Their findings show that this approach offers the highest level of compression among pangenomic formats that preserve variation, achieving compression improvements of hundreds to even thousands of times.

As an example, the team constructed the largest SARS-CoV-2 pangenome to date, incorporating more than 8 million individual viral genomes. Using the PanMAN approach, this enormous volume of genetic data required only 366 MB of storage, approximately 3,000 times less space than the equivalent whole-genome alignment encoded by PanMAN.

Building a whole-genome alignment for SARS-CoV-2 at this scale was itself a major technical challenge, which was solved using another computational tool developed in Turakhia’s lab, known as TWILIGHT. The researchers are now extending the application of TWILIGHT and PanMANs from microbial to human genomes.

Extending compressive pangenomics to human genomes can fundamentally transform how we store, analyze, and share large-scale human genetic data. Besides enabling studies of human genetic diversity, disease, and evolution at unprecedented scale and speed, it can depict detailed evolutionary and mutational histories which shape diverse human populations, something that current representations do not capture.

Yatish Turakhia, Study Corresponding Author, University of California San Diego

Source:
Journal reference:

Walia, S., et al. (2026) Compressive pangenomics using mutation-annotated networks. Nature Genetics. DOI: 10.1038/s41588-025-02478-7. https://www.nature.com/articles/s41588-025-02478-7.

Posted in: Genomics

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoLifeSciences.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Large-Scale Study Maps DNA Variants Shaping Human Disease Risk