The first large-scale WGS project to produce a catalog of human genetic variation was the 1000 Genomes Project (1kGP), which was released as an open-access resource seven years ago. It was based primarily on low-coverage WGS data of 2,504 individuals from 26 populations representing five different continents.
Together with the original samples, scientists from the Human Genome Structural Variation Consortium (HGSVC), Yale University, and the New York Genome Center (NYGC) have now augmented the 1kGP resource to include nearly all parent-child trios in the collection and sequenced them at high coverage using Illumina NovaSeq machines.
The study, which was released in Cell, provides an in-depth analysis of the high-coverage WGS data on the extended 1kGP cohort, which now has 3,202 samples, including 602 trios.
The 1000 Genomes Project cohort is such a valuable resource, we felt it would be useful to the community to bring the sequencing up to date with the latest version of short-read technology while adding in the richness of the previously omitted family samples.”
Michael Zody, Study Senior Author and Scientific Director, Computational Biology, New York Genome Center
Researchers at the NYGC sequenced DNA from lymphoblastoid cell lines (LCLs; i.e., immortalized human B cells from peripheral blood) from the extended cohort to the desired depth of 30X genome coverage using cutting-edge techniques and algorithms.
The team next carried out single nucleotide variant (SNV) and short insertion and deletion (INDEL) calling, which entails identifying variant sites from the sequencing data relative to the reference human genome and genotyping of discovered variant sites across all samples in the cohort.
In addition, a group led by Dr Michael Talkowski’s group at the Harvard Medical School, Broad Institute, and Massachusetts General Hospital, working with Dr Ira Hall’s group at Yale University, the Washington University School of Medicine, and the HGSVC, identified and genotyped an extensive set of structural variants (SVs) in all 3,202 1kGP samples.
Ultimately, the study reveals substantial improvement in the discovery power and precision of variant calls, particularly for rare SNVs, as well as INDELs and SVs covering the frequency spectrum, which were hitherto unreachable with low-coverage sequencing.
The original 1kGP resource’s effectiveness as a reference panel for variant imputation—statistically inferring unobserved genotypes in sparse, array-based samples based on groupings of variants that are usually inherited together in the population—facilitated a number of genome-wide association studies (GWAS).
The team revised the reference imputation panel to now incorporate more variants found by high-coverage WGS and triple families due to the extension of the original database.
The new imputation panel includes more sites, especially many more common INDELs and SVs, thus expanding the number of variants accessible for GWAS, which, given the large effect size of non-SNV variation, is likely to enable discovery of new genetic associations that help pinpoint the causative variant.”
Marta Byrska-Bishop, Study Co-First Author and Senior Bioinformatics Scientist, New York Genome Center
The International Genome Sample Resource (IGSR), which is run by co-authors from the European Bioinformatics Institute at the European Molecular Biology Laboratory (EMBL-EBI), is one of the genomic data repositories where all raw sequence data and variant call sets were instantaneously made available to the public upon sequencing completion.
Our goal is to have this public resource serve as the benchmark for future population genetic studies and methods development.”
Xuefang Zhao, Study Co-First Author and Postdoctoral Fellow, Center for Genomic Medicine Massachusetts General Hospital
The community of evolutionary biologists and genome researchers has already expressed interest in the data. Due to the completely open nature of the 1kGP samples, which, unlike most recently developed WGS projects, are consented to the public release of genetic data without access or usage limitation, this is expected to continue for years to come.
Grants from the National Human Genome Research Institute (NHGRI) helped to fund the sequencing process. Grants from the NHGRI, National Institute of Child Health and Human Development (NICHD), National Institute of Mental Health (NIMH), European Molecular Biology Laboratory (EMBL), and Wellcome Trust helped fund a part of this analysis.
Bishop, M. B., et al (2022) High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios. Cell. doi:10.1016/j.cell.2022.08.004