nature.com

A detailed guide to assessing genome assembly based on long-read sequencing data using Inspector

Abstract

Long-read sequencing technologies yield extended DNA sequences capable of spanning intricate, repetitive genome regions, thereby facilitating the generation of more precise and comprehensive genome assemblies. However, assembly errors are inevitable owing to inherent genomic complexity and limitations of sequencing technology and assembly algorithms, making assembly evaluation crucial. The genome assembly evaluation tool Inspector presents several advantages over existing long-read de novo assembly evaluation tools, including (1) both reference-free and reference-guided assembly evaluation; (2) the ability to detect both small- and large-scale structural errors; (3) the option of assembly error correction, which can improve the quality value of the original assembly; and (4) the ability to perform haplotype-resolved assembly evaluation. Inspector can provide not only basic contig and alignment statistics, but also the precise locations and types of the different structural errors. These advantages provide a robust framework for long-read assembly evaluation. In this Protocol, we showcase four procedures to demonstrate the different applications of Inspector for long-read assembly evaluation. Inspector software and additional guides can be found at https://github.com/ChongLab/Inspector_protocol.

Key points

Long-read sequencing has been instrumental in improving de novo assembly of genomes. However, genome complexity and limitations of sequencing technology and assembly algorithms necessitate comprehensive evaluation of the accuracy of the assembled genomes. Inspector is a flexible tool for reference-free or reference-guided assembly evaluation, showcased in four use-case scenarios described in this protocol.

Inspector identifies the types and precise locations of small- and large-scale structural errors and provides an error-correction module that can improve the quality of the original assembly.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Access Nature and 54 other Nature Portfolio journals

Get Nature+, our best-value online-access subscription

$29.99 / 30 days

cancel any time

Learn more

Subscribe to this journal

Receive 12 print issues and online access

$259.00 per year

only $21.58 per issue

Learn more

Buy this article

Purchase on SpringerLink

Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

Additional access options:

Log in

Learn about institutional subscriptions

Read our FAQs

Contact customer support

Fig. 1: The impact of long-read sequencing technologies on the growth of de novo assembly stuides.

Fig. 2: The workflow of Inspector.

Fig. 3: IGV visualization of structural assembly errors.

Data availability

The dataset used in this protocol can be downloaded with the provided links.

Code availability

All code and commands used in this protocol are available as Supplementary Data. Additionally, the code and commands used for this protocol can be found at https://github.com/ChongLab/Inspector_protocol. The original code for Inspector is hosted at https://github.com/ChongLab/Inspector.

References

Sohn, J. I. & Nam, J. W. The present and future of de novo whole-genome assembly. Brief. Bioinform. 19, 23–40 (2018).

CASPubMedGoogle Scholar

Siva, N. 1000 Genomes project. Nat. Biotechnol. 26, 256 (2008).

PubMedGoogle Scholar

Ashley, E. A. Towards precision medicine. Nat. Rev. Genet. 17, 507–522 (2016).

CASPubMedGoogle Scholar

The 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).

PubMed CentralGoogle Scholar

Butler, J. et al. ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res. 18, 810–820 (2008).

CASPubMedPubMed CentralGoogle Scholar

Zerbino, D. R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008).

CASPubMedPubMed CentralGoogle Scholar

Luo, R. et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience 1, 18 (2012).

PubMedPubMed CentralGoogle Scholar

Bradnam, K. R. et al. Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. Gigascience 2, 10 (2013).

PubMedPubMed CentralGoogle Scholar

Wang, Y., Zhao, Y., Bollas, A., Wang, Y. & Au, K. F. Nanopore sequencing technology, bioinformatics and applications. Nat. Biotechnol. 39, 1348–1365 (2021).

CASPubMedPubMed CentralGoogle Scholar

Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155–1162 (2019).

CASPubMedPubMed CentralGoogle Scholar

Logsdon, G. A., Vollger, M. R. & Eichler, E. E. Long-read human genome sequencing and its applications. Nat. Rev. Genet. 21, 597–614 (2020).

CASPubMedPubMed CentralGoogle Scholar

Gao, Y. et al. A pangenome reference of 36 Chinese populations. Nature 619, 112–121 (2023).

CASPubMedPubMed CentralGoogle Scholar

Sherman, R. M. & Salzberg, S. L. Pan-genomics in the human genome era. Nat. Rev. Genet. 21, 243–254 (2020).

CASPubMedPubMed CentralGoogle Scholar

Marx, V. Method of the year: long-read sequencing. Nat. Methods 20, 6–11 (2023).

CASPubMedGoogle Scholar

Wang, Y., Yang, Q. & Wang, Z. The evolution of nanopore sequencing. Front. Genet. 5, 449 (2015).

PubMedPubMed CentralGoogle Scholar

Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).

CASPubMedPubMed CentralGoogle Scholar

Nurk, S. et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res. 30, 1291–1305 (2020).

CASPubMedPubMed CentralGoogle Scholar

Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019).

CASPubMedGoogle Scholar

Ruan, J. & Li, H. Fast and accurate long-read assembly with wtdbg2. Nat. Methods 17, 155–158 (2020).

CASPubMedGoogle Scholar

Lieberman-Aiden, E. et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326, 289–293 (2009).

CASPubMedPubMed CentralGoogle Scholar

Cheng, H., Asri, M., Lucas, J., Koren, S. & Li, H. Scalable telomere-to-telomere assembly for diploid and polyploid genomes with double graph. Nat. Methods 1–4 (2024).

Cheng, H. et al. Haplotype-resolved assembly of diploid genomes without parental data. Nat. Biotechnol. 40, 1332–1335 (2022).

CASPubMedGoogle Scholar

Rautiainen, M. et al. Telomere-to-telomere assembly of diploid chromosomes with Verkko. Nat. Biotechnol. 41, 1474–1482 (2023).

Chen, Y., Zhang, Y., Wang, A. Y., Gao, M. & Chong, Z. Accurate long-read de novo assembly evaluation with Inspector. Genome Biol. 22, 312 (2021).

PubMedPubMed CentralGoogle Scholar

Tanudisastro, H. A., Deveson, I. W., Dashnow, H. & MacArthur, D. G. Sequencing and characterizing short tandem repeats in the human genome. Nat. Rev. Genet. 25, 460–475 (2024).

CASPubMedGoogle Scholar

Agustinho, D. P. et al. Unveiling microbial diversity: harnessing long-read sequencing technology. Nat. Methods 21, 954–966 (2024).

CASPubMedGoogle Scholar

Logsdon, G. A. et al. Complex genetic variation in nearly complete human genomes. Preprint at bioRxivhttps://doi.org/10.1101/2024.09.24.614721 (2024).

Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212 (2015).

PubMedGoogle Scholar

Mikheenko, A., Prjibelski, A., Saveliev, V., Antipov, D. & Gurevich, A. Versatile genome assembly evaluation with QUAST-LG. Bioinformatics 34, i142–i150 (2018).

CASPubMedPubMed CentralGoogle Scholar

Mikheenko, A., Saveliev, V. & Gurevich, A. MetaQUAST: evaluation of metagenome assemblies. Bioinformatics 32, 1088–1090 (2016).

CASPubMedGoogle Scholar

Manchanda, N. et al. GenomeQC: a quality assessment tool for genome assemblies and gene structure annotations. BMC Genomics 21, 193 (2020).

PubMedPubMed CentralGoogle Scholar

Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 21, 245 (2020).

CASPubMedPubMed CentralGoogle Scholar

Chen, Y. et al. Deciphering the exact breakpoints of structural variations using long sequencing reads with DeBreak. Nat. Commun. 14, 283 (2023).

CASPubMedPubMed CentralGoogle Scholar

Chen, Y. et al. Gene fusion detection and characterization in long-read cancer transcriptome sequencing data with fusionseeker. Cancer Res. 83, 28–33 (2023).

CASPubMedGoogle Scholar

Kolmogorov, M. et al. metaFlye: scalable long-read metagenome assembly using repeat graphs. Nat. Methods 17, 1103–1110 (2020).

CASPubMedPubMed CentralGoogle Scholar

Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

CASPubMedPubMed CentralGoogle Scholar

Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

PubMedPubMed CentralGoogle Scholar

Pearson, W. R. & Lipman, D. J. Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA 85, 2444–2448 (1988).

CASPubMedPubMed CentralGoogle Scholar

Cock, P. J., Fields, C. J., Goto, N., Heuer, M. L. & Rice, P. M. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 38, 1767–1771 (2010).

CASPubMedGoogle Scholar

Zook, J. M. et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data 3, 1–26 (2016).

Google Scholar

Fairley, S., Lowy-Gallego, E., Perry, E. & Flicek, P. The International Genome Sample Resource (IGSR) collection of open human genomic variation resources. Nucleic Acids Res. 48, D941–D947 (2020).

CASPubMedGoogle Scholar

Shafin, K. et al. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nat. Biotechnol. 38, 1044–1053 (2020).

CASPubMedPubMed CentralGoogle Scholar

Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).

CASPubMedPubMed CentralGoogle Scholar

Kent, W. J. et al. The human genome browser at UCSC. Genome Res. 12, 996–1006 (2002).

CASPubMedPubMed CentralGoogle Scholar

Robinson, J. T. et al. Integrative genomics viewer. Nat. Biotechnol. 29, 24–26 (2011).

CASPubMedPubMed CentralGoogle Scholar

Download references

Acknowledgements

This study was supported by the Department of Biomedical Informatics and Data Science, Marnix E. Heersink School of Medicine, University of Alabama at Birmingham, and the Biostatistics and Bioinformatics Shared Resource, Sylvester Comprehensive Cancer Center, University of Miami. Y.G. was supported by P30CA240139, R01ES030993 and R01ES035421 from NIH, USA Z.C. was supported by the MIRA award (1R35GM138212) from NIH/NIGMS.

Author information

Author notes

These authors contributed equally: Yan Guo, Yuwei Song.

Authors and Affiliations

Department of Public Health and Sciences, University of Miami, Miami, FL, USA

Yan Guo, Limin Jiang & Michele Ceccarelli

Department of Biomedical Informatics and Data Science, Heersink School of Medicine, University of Alabama, Birmingham, AL, USA

Yuwei Song, Yu Chen, Min Gao & Zechen Chong

HudsonAlpha Institute for Biotechnology, Huntsville, AL, USA

Zechen Chong

Authors

Yan Guo

View author publications

You can also search for this author inPubMedGoogle Scholar

2. Yuwei Song

View author publications

You can also search for this author inPubMedGoogle Scholar

3. Limin Jiang

View author publications

You can also search for this author inPubMedGoogle Scholar

4. Yu Chen

View author publications

You can also search for this author inPubMedGoogle Scholar

5. Michele Ceccarelli

View author publications

You can also search for this author inPubMedGoogle Scholar

6. Min Gao

View author publications

You can also search for this author inPubMedGoogle Scholar

7. Zechen Chong

View author publications

You can also search for this author inPubMedGoogle Scholar

Contributions

Y.G. and Z.C. conceived and managed the project. Y.S. collected all the datasets and performed all the analyses. L.J., Y.C., M.C. and M.G. were involved in testing and evaluating the tool. Y.G., Y.S., L.J. and Z.C. prepared the figures and tables, wrote the manuscript draft and revised the manuscript. All authors have read and approved the final manuscript.

Corresponding authors

Correspondence to Yan Guo or Zechen Chong.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Protocols thanks Guangyi Fan and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Key reference

Chen, Y. et al. Genome Biol. 22, 312 (2021): https://doi.org/10.1186/s13059-021-02527-4

Supplementary information

Supplementary Tables 1–3

Supplementary Tables 1–3.

Supplementary Code 1

Source codes for the entire protocol.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Guo, Y., Song, Y., Jiang, L. et al. A detailed guide to assessing genome assembly based on long-read sequencing data using Inspector. Nat Protoc (2025). https://doi.org/10.1038/s41596-025-01149-5

Download citation

Received:26 March 2024

Accepted:14 January 2025

Published:26 March 2025

DOI:https://doi.org/10.1038/s41596-025-01149-5

Share this article

Anyone you share the following link with will be able to read this content:

Get shareable link

Sorry, a shareable link is not currently available for this article.

Copy to clipboard

Provided by the Springer Nature SharedIt content-sharing initiative

Read full news in source page