In May 2023, the Human Pangenome Reference Consortium (HPRC) released the first draft of the human pangenome reference sequence, which integrated genomic data from 47 individuals representing genetic diversity and added 119 million base pairs of euchromatic polymorphic sequences. One month later, in June 2023, the Chinese Pangenome Consortium (CPC) further published 116 high-quality haplotype-resolved sequences based on samples from 36 ethnic minority groups in China, supplementing an additional 189 million base pairs of novel sequences. These breakthroughs have revealed the limitations of the traditional single reference genome in covering human genetic diversity. By integrating genetic information from multiple populations, the pangenome provides a novel framework for unraveling disease susceptibility and ethnic differences.
Yingyan Yu from Ruijin Hospital Affiliated to Shanghai Jiao Tong University School of Medicine and Hongzhuan Chen from Shanghai University of Traditional Chinese Medicine systematically reviewed the development of the human pangenome and its far-reaching implications for precision medicine in a commentary article.
The year 2023 marks the 70th anniversary of the discovery of the DNA double-helix structure. In April 1953, the team of Watson and Crick published a paper in Nature proposing the DNA double-helix structural model; the elucidation of its replication mechanism laid the foundation for genetic decoding. This discovery was rooted in multi-party collaboration: the "Photo 51" X-ray crystallographic image of DNA, taken in 1952 by Rosalind Franklin's team at King's College London, provided crucial evidence for the decipherment of the double-helix structure.
The first-generation sequencing technology invented by Sanger in 1977 paved the way for the launch of the Human Genome Project (HGP) in the 1990s. Chinese scientists contributed 1% of the sequencing work, and the project ultimately completed the sequencing of over 90% of the human genome in 2003. However, limited by technological capabilities at the time, the initial reference genome contained hundreds of sequence gaps and failed to capture population-specific variations. The advent of next-generation sequencing technology in 2005 reduced sequencing costs, spurring in-depth exploration of genetic diversity and giving rise to pangenome research.
The term "pangenome" is derived from the ancient Greek word meaning "whole," referring to the sum of genetic material across all individuals within a species. It comprises three components: core genes (shared by all individuals), distributed genes (present in some individuals but absent in others), and population-specific genes (unique to a particular ethnic group). This concept was first proposed in 2005 in a study on Streptococcus agalactiae and was later extended to research on plants and humans. In 2010, Chinese scientists integrated the genomes of an Asian individual and an African individual, identifying approximately 5 Mb of novel sequences not present in the existing reference genome-laying a foundation for the construction of the human pangenome.
Compared with microbial genomes, the human genome generates an enormous volume of data, and its research relies on breakthroughs in methodologies, including the development of high-performance computing clusters and automated analysis tools.
A core challenge in pangenome research lies in processing massive datasets. A joint team from Shanghai Jiao Tong University and Shanghai University of Traditional Chinese Medicine developed the HUman Pan-genome ANalysis (HUPAN) automated pipeline, which enables large-scale whole-genome sequencing (WGS) data analysis by leveraging high-performance computing. This tool was awarded the "Top 10 Algorithms and Tools for Bioinformatics in China" in 2019. Technically, third-generation sequencing, with its advantage of longer read lengths, addresses the limitations of gene annotation inaccuracies associated with second-generation sequencing.
In cancer research, the team used HUPAN to analyze 185 pairs of gastric cancer tissue samples from Han Chinese individuals, constructing the first human gastric cancer pangenome. This pangenome includes 80.88 Mb of previously unmapped novel sequences and revealed that the deletion frequency of distributed genes (e.g., GSTM1, SIGLEC14) is significantly higher in the Han population than in Western populations-providing a genetic explanation for the ethnic susceptibility to gastric cancer. Furthermore, 14 novel genes were predicted from the novel sequences; among them, the gene GC0643 was mapped to the 9q34.2 locus. In vitro experiments confirmed that GC0643 inhibits cancer cell growth and promotes apoptosis, and it has been registered in the NCBI database (GenBank: MW194843.1).
The maturation of pangenome technology is driving the era of individualized genomics. Data from the HPRC indicate that genomic differences between humans account for approximately 0.4% of the entire genome, and structural variations within this fraction may be closely associated with disease susceptibility. For instance, the deletion of SIGLEC14 in the Han population may impair innate immunity, while the deletion of ACOT1 might influence cancer development through fatty acid metabolism. Notably, some of these variations overlap with the Neanderthal genome, offering clues for human evolution research.
In the future, "tailored" treatments based on the pangenome will optimize disease diagnosis and drug selection, explain ethnic differences in disease incidence, and clarify the heterogeneity of drug responses-ultimately propelling precision medicine into a new era.
Source:
Journal reference:
Yu, Y. & Chen, H. (2023). Human pangenome: far-reaching implications in precision medicine. Frontiers of Medicine. doi: 10.1007/s11684-023-1039-1. https://journal.hep.com.cn/fmd/EN/10.1007/s11684-023-1039-1