nature.com

Tutorial: guidelines for quality filtering of whole-exome and whole-genome sequencing data for population-scale…

Abstract

Genetic sequencing technologies are powerful tools for identifying rare variants and genes associated with Mendelian and complex traits; indeed, whole-exome and whole-genome sequencing are increasingly popular methods for population-scale genetic studies. However, careful quality control steps should be taken to ensure study accuracy and reproducibility, and sequencing data require extensive quality filtering to delineate true variants from technical artifacts. Although processing standards are harmonized across pipelines to call variants from sequencing reads, there currently exists no standardized pipeline for conducting quality filtering on variant-level datasets for the purpose of population-scale association analysis. In this Tutorial, we discuss key quality control parameters, provide guidelines for conducting quality filtering of samples and variants, and compare commonly used software programs for quality control of samples, variants and genotypes from sequencing data. As sequencing data continue to gain popularity in genetic research, establishing standardized quality control practices is crucial to ensure consistent, reliable and reproducible results across studies.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Access Nature and 54 other Nature Portfolio journals

Get Nature+, our best-value online-access subscription

$29.99 / 30 days

cancel any time

Learn more

Subscribe to this journal

Receive 12 print issues and online access

$259.00 per year

only $21.58 per issue

Learn more

Buy this article

Purchase on SpringerLink

Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

Additional access options:

Learn about institutional subscriptions

Read our FAQs

Contact customer support

Fig. 1: Overview of data processing steps and quality filtering for samples, genotypes and variants for sequencing data.

Fig. 2: The effects of filtering heterozygosity ratio with criteria from different samples stratified by ancestry.

Fig. 3: Distributions of sample QC metrics stratified by ancestry for WES (left) and WGS (right) from the 1KGP+HGDP dataset.

Data availability

Figures 2 and 3 and Table 3 were created using the publicly available 1000 Genomes Project phase 3 and Human Genome Diversity Project data. These datasets can be directly loaded into Hail as a matrix table using the dataset repository (https://hail.is/docs/0.2/datasets.html).

Code availability

Python code for conducting sample and variant filtering using Hail can be found here at https://github.com/jsealock1/sequencing_qc.

References

Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).

CASPubMedPubMed CentralGoogle Scholar

Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590, 290–299 (2021).

CASPubMedPubMed CentralGoogle Scholar

Goldfeder, R. L. et al. Medical implications of technical accuracy in genome sequencing. Genome Med. 8, 24 (2016).

PubMedPubMed CentralGoogle Scholar

Nielsen, R., Paul, J. S., Albrechtsen, A. & Song, Y. S. Genotype and SNP calling from next-generation sequencing data. Nat. Rev. Genet. 12, 443–451 (2011).

CASPubMedPubMed CentralGoogle Scholar

Olson, N. D. et al. Variant calling and benchmarking in an era of complete human genome sequences. Nat. Rev. Genet. https://doi.org/10.1038/s41576-023-00590-0 (2023).

Carson, A. R. et al. Effective filtering strategies to improve data quality from population-based whole exome sequencing studies. BMC Bioinforma. 15, 125 (2014).

Google Scholar

Regier, A. A. et al. Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects. Nat. Commun. 9, 4038 (2018).

PubMedPubMed CentralGoogle Scholar

Auwera, G. A. & O’Connor, B. D. Genomics in the Cloud: Using Docker, GATK, and WDL in Terra (O’Reilly Media, 2020).

Poplin, R. et al. Scaling accurate genetic variant discovery to tens of thousands of samples. Preprint at bioRxivhttps://doi.org/10.1101/201178 (2018).

Behera, S. et al. Comprehensive genome analysis and variant detection at scale using DRAGEN. Nat. Biotechnol. https://doi.org/10.1038/s41587-024-02382-1 (2024).

Lam, M. et al. RICOPILI: Rapid Imputation for COnsortias PIpeLIne. Bioinformatics 36, 930–933 (2020).

CASPubMedGoogle Scholar

Guo, Y. et al. Illumina human exome genotyping array clustering and quality control. Nat. Protoc. 9, 2643–2662 (2014).

CASPubMedPubMed CentralGoogle Scholar

Rehm, H. L. et al. ACMG clinical laboratory standards for next-generation sequencing. Genet. Med. 15, 733–747 (2013).

PubMedPubMed CentralGoogle Scholar

Marshall, C. R. et al. Best practices for the analytical validation of clinical whole-genome sequencing intended for the diagnosis of germline disease. npj Genom. Med. 5, 1–12 (2020).

Google Scholar

Bergström, A. et al. Insights into human genetic variation and population history from 929 diverse genomes. Science 367, eaay5012 (2020).

PubMedPubMed CentralGoogle Scholar

Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).

PubMedGoogle Scholar

Hu, T., Chitnis, N., Monos, D. & Dinh, A. Next-generation sequencing technologies: an overview. Hum. Immunol. 82, 801–811 (2021).

CASPubMedGoogle Scholar

De Coster, W., Weissensteiner, M. H. & Sedlazeck, F. J. Towards population-scale long-read sequencing. Nat. Rev. Genet. 22, 572–587 (2021).

PubMedPubMed CentralGoogle Scholar

Chowdhury, B. & Garai, G. A review on multiple sequence alignment from the perspective of genetic algorithm. Genomics 109, 419–431 (2017).

CASPubMedGoogle Scholar

Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).

CASPubMedPubMed CentralGoogle Scholar

Langmead, B. & Nellore, A. Cloud computing for genomic data analysis and collaboration. Nat. Rev. Genet. 19, 208–219 (2018).

CASPubMedPubMed CentralGoogle Scholar

Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).

CASPubMedPubMed CentralGoogle Scholar

Pedersen, B. S. & Quinlan, A. R. Vcfexpress: flexible, rapid user-expressions to filter and format VCFs. Preprint at bioRxivhttps://doi.org/10.1101/2024.11.05.622129 (2024).

Danecek, P. et al. Twelve years of SAMtools and BCFtools. GigaScience 10, giab008 (2021).

PubMedPubMed CentralGoogle Scholar

Timothy, P. et al. The scalable variant call representation: enabling genetic analysis beyond one million genomes. Bioinformatics 41, btae746 (2025).

Google Scholar

Orlov, Y. L. & Potapov, V. N. Complexity: an internet resource for analysis of DNA sequence complexity. Nucleic Acids Res. 32, W628–W633 (2004).

CASPubMedPubMed CentralGoogle Scholar

Li, H. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics 30, 2843–2851 (2014).

CASPubMedPubMed CentralGoogle Scholar

Singh, T. et al. Rare coding variants in ten genes confer substantial risk for schizophrenia. Nature 604, 509–516 (2022).

CASPubMedPubMed CentralGoogle Scholar

Lander, E. S. & Waterman, M. S. Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 2, 231–239 (1988).

CASPubMedGoogle Scholar

Sims, D., Sudbery, I., Ilott, N. E., Heger, A. & Ponting, C. P. Sequencing depth and coverage: key considerations in genomic analyses. Nat. Rev. Genet. 15, 121–132 (2014).

CASPubMedGoogle Scholar

Muyas, F. et al. Allele balance bias identifies systematic genotyping errors and false disease associations. Hum. Mutat. 40, 115–126 (2019).

CASPubMedGoogle Scholar

Zhang, F. et al. Ancestry-agnostic estimation of DNA sample contamination from sequence reads. Genome Res. 30, 185–194 (2020).

CASPubMedPubMed CentralGoogle Scholar

Lu, W. et al. CHARR efficiently estimates contamination from DNA sequencing data. Am. J. Hum. Genet. 110, 2068–2076 (2023).

CASPubMedPubMed CentralGoogle Scholar

Gaspar, H. A. & Breen, G. Probabilistic ancestry maps: a method to assess and visualize population substructures in genetics. BMC Bioinforma. 20, 116 (2019).

Google Scholar

Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).

CASPubMedGoogle Scholar

Price, A. L., Zaitlen, N. A., Reich, D. & Patterson, N. New approaches to population stratification in genome-wide association studies. Nat. Rev. Genet. 11, 459–463 (2010).

CASPubMedPubMed CentralGoogle Scholar

Chen, S. et al. A genomic mutational constraint map using variation in 76,156 human genomes. Nature 625, 92–100 (2024).

CASPubMedGoogle Scholar

Anderson, C. A. et al. Data quality control in genetic case-control association studies. Nat. Protoc. 5, 1564–1573 (2010).

CASPubMedPubMed CentralGoogle Scholar

Guo, Y. et al. Multi-perspective quality control of Illumina exome sequencing data using QC3. Genomics 103, 323–328 (2014).

CASPubMedGoogle Scholar

Guo, Y., Ye, F., Sheng, Q., Clark, T. & Samuels, D. C. Three-stage quality control strategies for DNA re-sequencing data. Brief. Bioinform. 15, 879–889 (2014).

CASPubMedGoogle Scholar

Ng, S. B. et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature 461, 272–276 (2009).

CASPubMedPubMed CentralGoogle Scholar

Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).

CASPubMedPubMed CentralGoogle Scholar

Neuman, J. A., Isakov, O. & Shomron, N. Analysis of insertion–deletion from deep-sequencing data: software evaluation for optimal detection. Brief. Bioinform. 14, 46–55 (2013).

PubMedGoogle Scholar

Boltz, T. A. et al. A blended genome and exome sequencing method captures genetic variation in an unbiased, high-quality, and cost-effective manner. Preprint at bioRxivhttps://doi.org/10.1101/2024.09.06.611689 (2024).

Download references

Acknowledgements

This work is supported by the Novo Nordisk Foundation (NNF21SA0072102) with the following funding sources: R37MH107649, U01MH125047, R01MH101244.

Author information

Authors and Affiliations

Analytic and Translational Genetics Unit, Department of Medicine, Massachusetts General Hospital, Boston, MA, USA

Julia M. Sealock, Franjo Ivankovic, Calwing Liao, Siwei Chen, Claire Churchhouse, Konrad J. Karczewski, Daniel P. Howrigan & Benjamin M. Neale

Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA

Julia M. Sealock, Franjo Ivankovic, Calwing Liao, Siwei Chen, Claire Churchhouse, Konrad J. Karczewski, Daniel P. Howrigan & Benjamin M. Neale

Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA

Konrad J. Karczewski & Benjamin M. Neale

Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, MA, USA

Konrad J. Karczewski & Benjamin M. Neale

Authors

Julia M. Sealock

View author publications

You can also search for this author inPubMedGoogle Scholar

2. Franjo Ivankovic

View author publications

You can also search for this author inPubMedGoogle Scholar

3. Calwing Liao

View author publications

You can also search for this author inPubMedGoogle Scholar

4. Siwei Chen

View author publications

You can also search for this author inPubMedGoogle Scholar

5. Claire Churchhouse

View author publications

You can also search for this author inPubMedGoogle Scholar

6. Konrad J. Karczewski

View author publications

You can also search for this author inPubMedGoogle Scholar

7. Daniel P. Howrigan

View author publications

You can also search for this author inPubMedGoogle Scholar

8. Benjamin M. Neale

View author publications

You can also search for this author inPubMedGoogle Scholar

Contributions

This tutorial was designed, developed, and written by J.M.S.; F.I., C.L., S.C., C.C., K.J.K., D.P.H. and B.M.N. provided critical feedback and manuscript edits; and K.J.K., D.P.H. and and B.M.N. supervised the work. All authors approved the final manuscript.

Corresponding author

Correspondence to Julia M. Sealock.

Ethics declarations

Competing interests

B.M.N. is a member of the scientific advisory board at Deep Genomics and Neumora. K.J.K. is a consultant for Tome Biosciences, AlloDx and Vor Biosciences, and a member of the scientific advisory board of Nurture Genomics.

Peer review

Peer review information

Nature Protocols thanks Valerio Napolioni, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Note on Sequencing Generation, Supplementary Fig. 1 describing the structure of a Hail matrix table and references for the Supplementary Note.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sealock, J.M., Ivankovic, F., Liao, C. et al. Tutorial: guidelines for quality filtering of whole-exome and whole-genome sequencing data for population-scale association analyses. Nat Protoc (2025). https://doi.org/10.1038/s41596-025-01169-1

Download citation

Received:28 June 2024

Accepted:04 March 2025

Published:28 March 2025

DOI:https://doi.org/10.1038/s41596-025-01169-1

Share this article

Anyone you share the following link with will be able to read this content:

Get shareable link

Sorry, a shareable link is not currently available for this article.

Copy to clipboard

Provided by the Springer Nature SharedIt content-sharing initiative

Read full news in source page