Abstract
Genetic sequencing technologies are powerful tools for identifying rare variants and genes associated with Mendelian and complex traits; indeed, whole-exome and whole-genome sequencing are increasingly popular methods for population-scale genetic studies. However, careful quality control steps should be taken to ensure study accuracy and reproducibility, and sequencing data require extensive quality filtering to delineate true variants from technical artifacts. Although processing standards are harmonized across pipelines to call variants from sequencing reads, there currently exists no standardized pipeline for conducting quality filtering on variant-level datasets for the purpose of population-scale association analysis. In this Tutorial, we discuss key quality control parameters, provide guidelines for conducting quality filtering of samples and variants, and compare commonly used software programs for quality control of samples, variants and genotypes from sequencing data. As sequencing data continue to gain popularity in genetic research, establishing standardized quality control practices is crucial to ensure consistent, reliable and reproducible results across studies.
Access through your institution
Buy or subscribe
This is a preview of subscription content, access via your institution
Access options
Access through your institution
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Learn more
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Learn more
Buy this article
Purchase on SpringerLink
Instant access to full article PDF
Buy now
Prices may be subject to local taxes which are calculated during checkout
Additional access options:
Log in
Learn about institutional subscriptions
Read our FAQs
Contact customer support
Fig. 1: Overview of data processing steps and quality filtering for samples, genotypes and variants for sequencing data.
Fig. 2: The effects of filtering heterozygosity ratio with criteria from different samples stratified by ancestry.
Fig. 3: Distributions of sample QC metrics stratified by ancestry for WES (left) and WGS (right) from the 1KGP+HGDP dataset.
Data availability
Figures 2 and 3 and Table 3 were created using the publicly available 1000 Genomes Project phase 3 and Human Genome Diversity Project data. These datasets can be directly loaded into Hail as a matrix table using the dataset repository (https://hail.is/docs/0.2/datasets.html).
Code availability
Python code for conducting sample and variant filtering using Hail can be found here at https://github.com/jsealock1/sequencing_qc.
References
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
CASPubMedPubMed CentralGoogle Scholar
Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590, 290–299 (2021).
CASPubMedPubMed CentralGoogle Scholar
Goldfeder, R. L. et al. Medical implications of technical accuracy in genome sequencing. Genome Med. 8, 24 (2016).
PubMedPubMed CentralGoogle Scholar
Nielsen, R., Paul, J. S., Albrechtsen, A. & Song, Y. S. Genotype and SNP calling from next-generation sequencing data. Nat. Rev. Genet. 12, 443–451 (2011).
CASPubMedPubMed CentralGoogle Scholar
Olson, N. D. et al. Variant calling and benchmarking in an era of complete human genome sequences. Nat. Rev. Genet. https://doi.org/10.1038/s41576-023-00590-0 (2023).
Carson, A. R. et al. Effective filtering strategies to improve data quality from population-based whole exome sequencing studies. BMC Bioinforma. 15, 125 (2014).
Google Scholar
Regier, A. A. et al. Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects. Nat. Commun. 9, 4038 (2018).
PubMedPubMed CentralGoogle Scholar
Auwera, G. A. & O’Connor, B. D. Genomics in the Cloud: Using Docker, GATK, and WDL in Terra (O’Reilly Media, 2020).
Poplin, R. et al. Scaling accurate genetic variant discovery to tens of thousands of samples. Preprint at bioRxivhttps://doi.org/10.1101/201178 (2018).
Behera, S. et al. Comprehensive genome analysis and variant detection at scale using DRAGEN. Nat. Biotechnol. https://doi.org/10.1038/s41587-024-02382-1 (2024).
Lam, M. et al. RICOPILI: Rapid Imputation for COnsortias PIpeLIne. Bioinformatics 36, 930–933 (2020).
CASPubMedGoogle Scholar
Guo, Y. et al. Illumina human exome genotyping array clustering and quality control. Nat. Protoc. 9, 2643–2662 (2014).
CASPubMedPubMed CentralGoogle Scholar
Rehm, H. L. et al. ACMG clinical laboratory standards for next-generation sequencing. Genet. Med. 15, 733–747 (2013).
PubMedPubMed CentralGoogle Scholar
Marshall, C. R. et al. Best practices for the analytical validation of clinical whole-genome sequencing intended for the diagnosis of germline disease. npj Genom. Med. 5, 1–12 (2020).
Google Scholar
Bergström, A. et al. Insights into human genetic variation and population history from 929 diverse genomes. Science 367, eaay5012 (2020).
PubMedPubMed CentralGoogle Scholar
Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
PubMedGoogle Scholar
Hu, T., Chitnis, N., Monos, D. & Dinh, A. Next-generation sequencing technologies: an overview. Hum. Immunol. 82, 801–811 (2021).
CASPubMedGoogle Scholar
De Coster, W., Weissensteiner, M. H. & Sedlazeck, F. J. Towards population-scale long-read sequencing. Nat. Rev. Genet. 22, 572–587 (2021).
PubMedPubMed CentralGoogle Scholar
Chowdhury, B. & Garai, G. A review on multiple sequence alignment from the perspective of genetic algorithm. Genomics 109, 419–431 (2017).
CASPubMedGoogle Scholar
Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).
CASPubMedPubMed CentralGoogle Scholar
Langmead, B. & Nellore, A. Cloud computing for genomic data analysis and collaboration. Nat. Rev. Genet. 19, 208–219 (2018).
CASPubMedPubMed CentralGoogle Scholar
Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).
CASPubMedPubMed CentralGoogle Scholar
Pedersen, B. S. & Quinlan, A. R. Vcfexpress: flexible, rapid user-expressions to filter and format VCFs. Preprint at bioRxivhttps://doi.org/10.1101/2024.11.05.622129 (2024).
Danecek, P. et al. Twelve years of SAMtools and BCFtools. GigaScience 10, giab008 (2021).
PubMedPubMed CentralGoogle Scholar
Timothy, P. et al. The scalable variant call representation: enabling genetic analysis beyond one million genomes. Bioinformatics 41, btae746 (2025).
Google Scholar
Orlov, Y. L. & Potapov, V. N. Complexity: an internet resource for analysis of DNA sequence complexity. Nucleic Acids Res. 32, W628–W633 (2004).
CASPubMedPubMed CentralGoogle Scholar
Li, H. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics 30, 2843–2851 (2014).
CASPubMedPubMed CentralGoogle Scholar
Singh, T. et al. Rare coding variants in ten genes confer substantial risk for schizophrenia. Nature 604, 509–516 (2022).
CASPubMedPubMed CentralGoogle Scholar
Lander, E. S. & Waterman, M. S. Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 2, 231–239 (1988).
CASPubMedGoogle Scholar
Sims, D., Sudbery, I., Ilott, N. E., Heger, A. & Ponting, C. P. Sequencing depth and coverage: key considerations in genomic analyses. Nat. Rev. Genet. 15, 121–132 (2014).
CASPubMedGoogle Scholar
Muyas, F. et al. Allele balance bias identifies systematic genotyping errors and false disease associations. Hum. Mutat. 40, 115–126 (2019).
CASPubMedGoogle Scholar
Zhang, F. et al. Ancestry-agnostic estimation of DNA sample contamination from sequence reads. Genome Res. 30, 185–194 (2020).
CASPubMedPubMed CentralGoogle Scholar
Lu, W. et al. CHARR efficiently estimates contamination from DNA sequencing data. Am. J. Hum. Genet. 110, 2068–2076 (2023).
CASPubMedPubMed CentralGoogle Scholar
Gaspar, H. A. & Breen, G. Probabilistic ancestry maps: a method to assess and visualize population substructures in genetics. BMC Bioinforma. 20, 116 (2019).
Google Scholar
Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).
CASPubMedGoogle Scholar
Price, A. L., Zaitlen, N. A., Reich, D. & Patterson, N. New approaches to population stratification in genome-wide association studies. Nat. Rev. Genet. 11, 459–463 (2010).
CASPubMedPubMed CentralGoogle Scholar
Chen, S. et al. A genomic mutational constraint map using variation in 76,156 human genomes. Nature 625, 92–100 (2024).
CASPubMedGoogle Scholar
Anderson, C. A. et al. Data quality control in genetic case-control association studies. Nat. Protoc. 5, 1564–1573 (2010).
CASPubMedPubMed CentralGoogle Scholar
Guo, Y. et al. Multi-perspective quality control of Illumina exome sequencing data using QC3. Genomics 103, 323–328 (2014).
CASPubMedGoogle Scholar
Guo, Y., Ye, F., Sheng, Q., Clark, T. & Samuels, D. C. Three-stage quality control strategies for DNA re-sequencing data. Brief. Bioinform. 15, 879–889 (2014).
CASPubMedGoogle Scholar
Ng, S. B. et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature 461, 272–276 (2009).
CASPubMedPubMed CentralGoogle Scholar
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
CASPubMedPubMed CentralGoogle Scholar
Neuman, J. A., Isakov, O. & Shomron, N. Analysis of insertion–deletion from deep-sequencing data: software evaluation for optimal detection. Brief. Bioinform. 14, 46–55 (2013).
PubMedGoogle Scholar
Boltz, T. A. et al. A blended genome and exome sequencing method captures genetic variation in an unbiased, high-quality, and cost-effective manner. Preprint at bioRxivhttps://doi.org/10.1101/2024.09.06.611689 (2024).
Download references
Acknowledgements
This work is supported by the Novo Nordisk Foundation (NNF21SA0072102) with the following funding sources: R37MH107649, U01MH125047, R01MH101244.
Author information
Authors and Affiliations
Analytic and Translational Genetics Unit, Department of Medicine, Massachusetts General Hospital, Boston, MA, USA
Julia M. Sealock, Franjo Ivankovic, Calwing Liao, Siwei Chen, Claire Churchhouse, Konrad J. Karczewski, Daniel P. Howrigan & Benjamin M. Neale
Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
Julia M. Sealock, Franjo Ivankovic, Calwing Liao, Siwei Chen, Claire Churchhouse, Konrad J. Karczewski, Daniel P. Howrigan & Benjamin M. Neale
Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
Konrad J. Karczewski & Benjamin M. Neale
Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, MA, USA
Konrad J. Karczewski & Benjamin M. Neale
Authors
Julia M. Sealock
View author publications
You can also search for this author inPubMedGoogle Scholar
2. Franjo Ivankovic
View author publications
You can also search for this author inPubMedGoogle Scholar
3. Calwing Liao
View author publications
You can also search for this author inPubMedGoogle Scholar
4. Siwei Chen
View author publications
You can also search for this author inPubMedGoogle Scholar
5. Claire Churchhouse
View author publications
You can also search for this author inPubMedGoogle Scholar
6. Konrad J. Karczewski
View author publications
You can also search for this author inPubMedGoogle Scholar
7. Daniel P. Howrigan
View author publications
You can also search for this author inPubMedGoogle Scholar
8. Benjamin M. Neale
View author publications
You can also search for this author inPubMedGoogle Scholar
Contributions
This tutorial was designed, developed, and written by J.M.S.; F.I., C.L., S.C., C.C., K.J.K., D.P.H. and B.M.N. provided critical feedback and manuscript edits; and K.J.K., D.P.H. and and B.M.N. supervised the work. All authors approved the final manuscript.
Corresponding author
Correspondence to Julia M. Sealock.
Ethics declarations
Competing interests
B.M.N. is a member of the scientific advisory board at Deep Genomics and Neumora. K.J.K. is a consultant for Tome Biosciences, AlloDx and Vor Biosciences, and a member of the scientific advisory board of Nurture Genomics.
Peer review
Peer review information
Nature Protocols thanks Valerio Napolioni, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Note on Sequencing Generation, Supplementary Fig. 1 describing the structure of a Hail matrix table and references for the Supplementary Note.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
Reprints and permissions
About this article
Check for updates. Verify currency and authenticity via CrossMark
Cite this article
Sealock, J.M., Ivankovic, F., Liao, C. et al. Tutorial: guidelines for quality filtering of whole-exome and whole-genome sequencing data for population-scale association analyses. Nat Protoc (2025). https://doi.org/10.1038/s41596-025-01169-1
Download citation
Received:28 June 2024
Accepted:04 March 2025
Published:28 March 2025
DOI:https://doi.org/10.1038/s41596-025-01169-1
Share this article
Anyone you share the following link with will be able to read this content:
Get shareable link
Sorry, a shareable link is not currently available for this article.
Copy to clipboard
Provided by the Springer Nature SharedIt content-sharing initiative