Abstract
The repetitive nature and complexity of some medically relevant genes poses a scenario for his or her excellent analysis in a clinical environment. The Genome in a Bottle Consortium has offered variant benchmark sets, however these exclude nearly 400 medically relevant genes ensuing from their repetitiveness or polymorphic complexity. Right here, we train 273 of these 395 sharp autosomal genes the utilize of a haplotype-resolved total-genome assembly. This curated benchmark experiences over 17,000 single-nucleotide variations, 3,600 insertions and deletions and 200 structural variations every for human genome reference GRCh37 and GRCh38 loyal through HG002. We stamp that flawed duplications in both GRCh37 or GRCh38 lead to reference-particular, overlooked variants for short- and lengthy-read applied sciences in medically relevant genes, alongside with CBS, CRYAA and KCNE1. When masking these flawed duplications, variant purchase can support from 8% to 100%. Forming benchmarks from a haplotype-resolved total-genome assembly would maybe presumably turn into a prototype for future benchmarks masking the total genome.
Right here is a preview of subscription snort
Rating entry to alternatives
Subscribe to Journal
Rating plump journal access for 1 one year
99,00 €
good 8,25 € per scenario
Tax calculation shall be finalised within the course of checkout.
Purchase article
Rating time miniature or plump article access on ReadCube.
$32.00
All costs are NET costs.
Records availability
The PacBio HiFi reads historical to generate the hifiasm assembly for the benchmark are within the NCBI Sequence Read Archive with accession numbers SRR10382245, SRR10382244, SRR10382249, SRR10382248, SRR10382247 and SRR10382246. The v1.00 benchmark VCF and BED files, as well to Liftoff gene annotations, assembly–assembly alignments and variant calls, are readily obtainable at https://label.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/CMRG_v1.00/, and as a DOI at https://doi.org/10.18434/mds2-2475. Right here is released as a separate benchmark from v4.2.1, because it contains a microscopic part of the genome, it has different traits from the mapping-primarily primarily based v4.2.1 and v4.2.1 good contains microscopic variants. The utilize of v4.2.1 and the CMRG benchmarks as two separate benchmarks enables customers to receive broader efficiency metrics for plenty of of the genome and for a microscopic online page online of seriously sharp genes, respectively. The masked GRCh38 reference, now not too lengthy within the past updated to v2 with extra flawed duplications from the Telomere-to-Telomere Consortium, is below https://label.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/references. We point out the utilize of v3.0 GA4GH/GIAB stratification mattress files supposed to be used with hap.py when benchmarking, which would maybe presumably per chance be readily obtainable at https://label.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/genome-stratifications/. These stratifications consist of mattress files equivalent to flawed duplications and collapsed duplications in GRCh38. All records don’t receive any restrictions, as the HG002 sample has an begin consent from the Non-public Genome Mission.
Code availability
Scripts historical to manufacture the CMRG benchmark and generate figures and tables for the manuscript are readily obtainable at https://github.com/usnistgov/cmrg-benchmarkset-manuscript. The beforehand developed assembly, which modified into historical as the premise of this benchmark, modified into from hifiasm v0.11.
A quantity of begin supply scheme modified into historical for variant calling for the critiques of the benchmark, alongside with NextDenovo2.2-beta.0, DRAGEN 3.6.3, NeuSomatic’s submission for the PrecisionFDA fact scenario v2 (ref. 12) (BWA-MEM50 version 0.7.17-r1188 (https://github.com/lh3/bwa) and GATK version gatk-4.1.4.1 (https://gatk.broadinstitute.org/hc/en-us)), Parabricks_DeepVariant (Parabricks Pipelines DeepVariant v3.0.0_2 (https://developer.nvidia.com/clara-parabricks)), Sentieon (DNAscope) version sentieon_release_201911 (https://www.sentieon.com/products/#dnaseq), BWA-MEM and Strelka2 (BWA-MEM version 0.7.17-r1188 (https://github.com/lh3/bwa) and Strelka2 version 2.9.10 (https://github.com/Illumina/strelka)), BWA-MEM50(v0.7.8), Picard tools (https://broadinstitute.github.io/picard/) (ver. 1.83), GATK52 (v3.4-0), GATK (v3.5), BWA-MEM v0.7.15-r1140, SAMtools53 v1.3, Picard v2.10.10, GATK v3.8, DELLY54 v0.8.5, GRIDSS55 v2.9.4, LUMPY56 v0.3.1, Manta57 v1.6.0, Wham58 v1.7.0, NanoPlot60 v1.27.0, Filtlong v0.2.0, minimap2 (refs. 40,60) v2.17-r941, cuteSV v1.0.8, Sniffles61 v1.0.12, SURVIVOR59 v1.0.7, BWA v0.7.15, GATK v3.6, Java v1.8.0_74 (OpenJDK), Picard Tools v2.6.0, Sambamba63 v0.6.7, Samblaster64 v0.1.24, Samtools v1.9, DeepVariant v1.0 and Liftoff32 v1.4.0.
References
- 1.
Wenger, A. M. et al. Trustworthy spherical consensus lengthy-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155–1162 (2019).
CAS Article Google Scholar
- 2.
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly the utilize of phased assembly graphs with hifiasm. Nat. Systems 18, 170–175 (2021).
CAS Article Google Scholar
- 3.
Nurk, S. et al. HiCanu: excellent assembly of segmental duplications, satellites, and allelic variants from excessive-fidelity lengthy reads. Genome Res. 30, 1291–1305 (2020).
CAS Article Google Scholar
- 4.
Shafin, K. et al. Nanopore sequencing and the Shasta toolkit allow ambiance friendly de novo assembly of 11 human genomes. Nat. Biotechnol. 38, 1044–1053 (2020).
CAS Article Google Scholar
- 5.
Mahmoud, M. et al. Structural variant calling: the lengthy and the wanting it. Genome Biol. 20, 246 (2019).
Article Google Scholar
- 6.
De Coster, W., Weissensteiner, M. H. & Sedlazeck, F. J. In direction of inhabitants-scale lengthy-read sequencing. Nat. Rev. Genet. 22, 572–587 (2021).
Article Google Scholar
- 7.
Mandelker, D. et al. Navigating extremely homologous genes in a molecular diagnostic environment: a resource for clinical subsequent-generation sequencing. Genet. Med. 18, 1282–1289 (2016).
CAS Article Google Scholar
- 8.
Ebbert, M. T. W. et al. Systematic analysis of darkish and camouflaged genes exhibits disease-relevant genes hiding in straightforward uncover. Genome Biol. 20, 1–23 (2019).
Article Google Scholar
- 9.
Lincoln, S. E. et al. One in seven pathogenic variants would be sharp to detect by NGS: an analysis of 450,000 patients with implications for clinical sensitivity and genetic take a look at implementation. Genet. Med. 23, 1673–1680 (2021).
- 10.
Zook, J. M. et al. An begin resource for precisely benchmarking microscopic variant and reference calls. Nat. Biotechnol. 37, 561–566 (2019).
CAS Article Google Scholar
- 11.
Zook, J. M. et al. A unparalleled benchmark for detection of germline ample deletions and insertions. Nat. Biotechnol. 38, 1347–1355 (2020) ; erratum 38, 1357 (2020).
CAS Article Google Scholar
- 12.
Olson, N. D. et al. precisionFDA Fact Bid V2: calling variants from short- and lengthy-reads in refined-to-plan regions. Preprint at bioRxiv https://doi.org/10.1101/2020.11.13.380741 (2020).
- 13.
Wagner, J. et al. Benchmarking sharp microscopic variants with linked and lengthy reads. Preprint at bioRxiv https://doi.org/10.1101/2020.07.24.212712 (2020).
- 14.
Chin, C.-S. et al. A diploid assembly-primarily primarily based benchmark for variants within the considerable histocompatibility advanced. Nat. Commun. 11, 4794 (2020).
CAS Article Google Scholar
- 15.
Goldfeder, R. L. et al. Clinical implications of technical accuracy in genome sequencing. Genome Med. 8, 24 (2016).
Article Google Scholar
- 16.
Ball, M. P. et al. A public resource facilitating clinical utilize of genomes. Proc. Natl Acad. Sci. USA 109, 11920–11927 (2012).
CAS Article Google Scholar
- 17.
Tate, J. G. et al. COSMIC: the Catalogue Of Somatic Mutations In Cancer. Nucleic Acids Res. 47, D941–D947 (2019).
CAS Article Google Scholar
- 18.
Ross, M. G. et al. Characterizing and measuring bias in sequence records. Genome Biol. 14, R51 (2013).
Article Google Scholar
- 19.
Prior, T. W., Leach, M. E. & Finanger, E. Spinal muscular atrophy. In GeneReviews [Internet] (University of Washington, 2020).
- 20.
Biros, I. & Forrest, S. Spinal muscular atrophy: untangling the knot? J. Med. Genet. 36, 1–8 (1999).
CAS PubMed PubMed Central Google Scholar
- 21.
Leiding, J. W. & Holland, S. M. Power granulomatous disease. In GeneReviews [Internet] (University of Washington, 2016).
- 22.
Innan, H. A two-locus gene conversion mannequin with quite quite rather a lot of and its application to the human RHCE and RHD genes. Proc. Natl. Acad. Sci. USA 100, 8793–8798 (2003).
CAS Article Google Scholar
- 23.
Hayakawa, T. et al. Coevolution of Siglec-11 and Siglec-16 by technique of gene conversion in primates. BMC Evol. Biol. 17, 228 (2017).
Article Google Scholar
- 24.
Garg, P. et al. Pervasive cis results of variation in reproduction quite quite rather a lot of of ample tandem repeats on native DNA methylation and gene expression. Am. J. Hum. Genet. https://doi.org/10.1016/j.ajhg.2021.03.016 (2021).
Article PubMed PubMed Central Google Scholar
- 25.
Lennerz, J. K. et al. Addition of H19 ‘lack of methylation checking out’ for Beckwith-Wiedemann syndrome (BWS) will increase the diagnostic yield. J. Mol. Diagn. 12, 576–588 (2010).
CAS Article Google Scholar
- 26.
Nurk, S. et al. The total sequence of a human genome. Preprint at bioRxiv https://doi.org/10.1101/2021.05.26.445798 (2021).
- 27.
Aganezov, S. et al. A total reference genome improves analysis of human genetic variation. Preprint at bioRxiv https://doi.org/10.1101/2021.07.12.452063 (2021).
- 28.
Boisson, B. et al. Rescue of recurrent deep intronic mutation underlying cell form–dependent quantitative NEMO deficiency. J. Clin. Invest. 129, 583–597 (2018).
Article Google Scholar
- 29.
1000 Genomes Mission Consortium et al. A world reference for human genetic variation. Nature 526, 68–74 (2015).
Article Google Scholar
- 30.
Schmidt, K., Noureen, A., Kronenberg, F. & Utermann, G. Structure, operate, and genetics of lipoprotein (a). J. Lipid Res. 57, 1339–1359 (2016).
CAS Article Google Scholar
- 31.
Li, H., Feng, X. & Chu, C. The form and building of reference pangenome graphs with minigraph. Genome Biol. 21, 265 (2020).
Article Google Scholar
- 32.
Shumate, A. & Salzberg, S. L. Liftoff: excellent mapping of gene annotations. Bioinform. 37, 1639–1643 (2020).
- 33.
Theunissen, F. et al. Structural variants would be a supply of missing heritability in sALS. Front. Neurosci. 14, 47 (2020).
Article Google Scholar
- 34.
Guo, Y. et al. Enhancements and impacts of GRCh38 human reference on excessive throughput sequencing records analysis.Genomics 109, 83–90 (2017).
CAS Article Google Scholar
- 35.
Pan, B. et al. Similarities and differences between variants known as with human reference genome HG19 or HG38. BMC Bioinform. 20, 101 (2019).
- 36.
Miller, C. A. et al. Failure to detect mutations in U2AF1 ensuing from modifications within the GRCh38 reference sequence. Preprint at bioRxiv https://doi.org/10.1101/2021.05.07.442430 (2021).
- 37.
Li, H. et al. Exome variant discrepancies ensuing from reference-genome differences. Am. J. Hum. Genet. 108, 1239–1250 (2021).
CAS Article Google Scholar
- 38.
Collins, R. L. et al. A structural variation reference for clinical and inhabitants genetics. Nature 590, E55 (2021).
CAS Article Google Scholar
- 39.
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic aspects. Bioinform. 26, 841–842 (2010).
- 40.
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinform. 34, 3094–3100 (2018).
- 41.
Krusche, P. et al. Most efficient practices for benchmarking germline microscopic-variant calls in human genomes. Nat. Biotechnol. 37, 555–560 (2019).
CAS Article Google Scholar
- 42.
Van der Auwera, G. A. & O’Connor, B. D. Genomics within the Cloud: The utilize of Docker, GATK, and WDL in Terra (O’Reilly Media, 2020).
- 43.
Farek, J. et al. xAtlas: scalable microscopic variant calling loyal through heterogeneous subsequent-generation sequencing experiments. Preprint at bioRxiv https://doi.org/10.1101/295071 (2018).
- 44.
Edge, P. & Bansal, V. Longshot enables excellent variant calling in diploid genomes from single-molecule lengthy read sequencing. Nat. Commun. 10, 4660 (2019).
Article Google Scholar
- 45.
Shafin, K. et al. Haplotype-aware variant calling with PEPPER-Margin-DeepVariant enables excessive accuracy in nanopore lengthy-reads. Nat. Meth. 18, 1322–1332 (2021).
- 46.
Sahraeian, S. M. E. et al. Deep convolutional neural networks for excellent somatic mutation detection. Nat. Commun. 10, 1041 (2019).
Article Google Scholar
- 47.
Walker, B. J. et al. Pilon: an integrated instrument for comprehensive microbial variant detection and genome assembly development. PLoS One 9, e112963 (2014).
Article Google Scholar
- 48.
Patterson, M. et al. WhatsHap: weighted haplotype assembly for future-generation sequencing reads. J. Comput. Biol. 6, 498–509 (2015).
- 49.
Zook, J. M. et al. Huge sequencing of seven human genomes to train benchmark reference materials. Sci. Records 3, 160025 (2016).
CAS Article Google Scholar
- 50.
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://arxiv.org/abs/1303.3997 (2013).
- 51.
Regier, A. A. et al. Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling loyal through human genetics projects.Nat. Commun. 9, 4038 (2018).
Article Google Scholar
- 52.
Poplin, R. et al. Scaling excellent genetic variant discovery to tens of thousands of samples. Preprint at bioRxiv https://doi.org/10.1101/201178 (2018).
- 53.
Li, H. et al. The Sequence Alignment/Blueprint layout and SAMtools. Bioinform. 25, 2078–2079 (2009).
- 54.
Rausch, T. et al. DELLY: structural variant discovery by integrated paired-close and break up-read analysis. Bioinform. 28, 333–339 (2012).
- 55.
Cameron, D. L. et al. GRIDSS: sensitive and particular genomic rearrangement detection the utilize of positional de Bruijn graph assembly. Genome Res. 27, 2050–2060 (2017).
CAS Article Google Scholar
- 56.
Layer, R. M., Chiang, C., Quinlan, A. R. & Hall, I. M. LUMPY: a probabilistic framework for structural variant discovery. Genome Biol. 15, R84 (2014).
Article Google Scholar
- 57.
Chen, X. et al. Manta: quick detection of structural variants and indels for germline and most cancers sequencing positive aspects. Bioinform. 32, 1220–1222 (2016).
- 58.
Kronenberg, Z. N. et al. Wham: figuring out structural variants of natural . PLoS Comput. Biol. 11, e1004572 (2015).
Article Google Scholar
- 59.
Jeffares, D. C. et al. Transient structural variations receive solid results on quantitative traits and reproductive isolation in fission yeast. Nat. Commun. 8, 14061 (2017).
CAS Article Google Scholar
- 60.
De Coster, W., D’Hert, S., Schultz, D. T., Cruts, M. & Van Broeckhoven, C. NanoPack: visualizing and processing lengthy-read sequencing records. Bioinform. 34, 2666–2669 (2018).
- 61.
Sedlazeck, F. J. et al. Trustworthy detection of advanced structural variations the utilize of single-molecule sequencing. Nat. Systems 15, 461–468 (2018).
CAS Article Google Scholar
- 62.
Jiang, T. et al. Long-read-primarily primarily based human genomic structural variation detection with cuteSV. Genome Biol. 21, 189 (2020).
CAS Article Google Scholar
- 63.
Tarasov, A., Vilella, A. J., Cuppen, E., Nijman, I. J. & Prins, P. Sambamba: quick processing of NGS alignment formats. Bioinform. 31, 2032–2034 (2015).
- 64.
Faust, G. G. & Hall, I. M. SAMBLASTER: quick reproduction marking and structural variant read extraction. Bioinform. 30, 2503–2505 (2014).
- 65.
Poplin, R. et al. A universal SNP and microscopic-indel variant caller the utilize of deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).
CAS Article Google Scholar
Secure references
Acknowledgements
We thank the Genome Reference Consortium for his or her curation efforts of GRCh37 and GRCh38 (https://www.genomereference.org), especially V.A. Schneider and P.A. Kitts from the Nationwide Institutes of Health (NIH)/NCBI for increasing the falsely duplicated regions that ought to be masked in GRCh38. We thank S. Miller at NIST for serving to type readily obtainable benchmark sets and READMEs. Certain industrial tools, devices or materials are identified to adequately specify experimental prerequisites or reported results. Such identification doesn’t mean advice or endorsement by NIST, nor does it point out that the tools, devices or materials identified are necessarily the most efficient readily obtainable for the intention. C.F. modified into funded by Instituto de Salud Carlos III (PI20/00876) and Ministerio de Ciencia e Innovación (RTC-2017-6471-1; AEI/FEDER, UE), cofinanced by the European Regional Style Fund ‘A Strategy of Making Europe’ from the European Union, and Cabildo Insular de Tenerife (CGIEU0000219140). J.M.L.-S. modified into funded by Consejería de Educación-Gobierno de Canarias and Cabildo Insular de Tenerife (BOC 163, 24/08/2017). F.J.S. and M.M. modified into supported by the NIH (UM1 HG008898). C.X. modified into supported by the Intramural Evaluate Program of the Nationwide Library of Treatment, NIH. K.H.M. modified into supported by the NIH/Nationwide Human Genome Evaluate Institute (R01 1R01HG011274-01 and U01 1U01HG010971). H.L. modified into supported by the NIH (R01 HG010040 and U01 HG010961). C.E.M. thanks funding from the WorldQuant Foundation, NASA (NNX14AH50G), the Nationwide Institutes of Health (R01MH117406, R01CA249054, R01AI151059, P01CA214274) and the Leukemia and Lymphoma Society (LLS) (MCL7001-18, LLS 9238-16, LLS-MCL7001-18).
Ethics declarations
Competing pursuits
A.M.W. and W.J.R. are workers and shareholders of Pacific Biosciences. A.F., Y.-C.H, R.G., and C.-S.C. are workers and shareholders of DNAnexus. S.M.E.S. is an employee of Roche. J.L. is a outdated employee and shareholder of Bionano Genomics. S.E.L. modified into an employee of Invitae. F.J.S. has backed shuttle from Pacific Biosciences and Oxford Nanopore Technologies. The last authors uncover no competing pursuits.
Look evaluation
Look evaluation records
Nature Biotechnology thanks Adam Ameur, Christian Marshall and different, anonymous, reviewer(s) for his or her contribution to the undercover agent evaluation of this work.
Additional records
Writer’s stamp Springer Nature stays neutral with regards to jurisdictional claims in printed maps and institutional affiliations.
Supplementary records
Supplementary Records
Supplementary Figures 1–17, Notes 1–5 and Desk 1.
Supplementary Records 1
Additional traits of excessive-priority clinical genes.
Supplementary Records 2
Overlaps of the 5,038 genes on GRCh38 predominant assembly between both HG002 GRCh38 v4.2.1 and HG002 hifiasm v0.11.
Supplementary Records 3
Benchmarking of the hifiasm v0.11 assembly-primarily primarily based variants known as with dipcall in opposition to the GIAB v4.2.1 benchmark for HG002.
Supplementary Records 4
Benchmarking statistics in opposition to CMRG benchmark and overview callsets.
Supplementary Records 5
Ebook curation results for overview and fashioned errors in v0.02.03 microscopic variant benchmark.
Supplementary Records 6
Primer designs and response prerequisites for Long-Vary PCR and Sanger confirmation.
Supplementary Records 7
Genes excluded from the CMRG benchmarks, with in all probability causes for exclusion annotated for GRCh38 within the closing column.
Supplementary Records 8
Instructions for BWA-GATK variant calling on fashioned GRCh38 reference.
Supplementary Records 9
Instructions for BWA-GATK variant calling on v1 masked GRCh38 reference.
About this article
Cite this article
Wagner, J., Olson, N.D., Harris, L. et al. Curated variation benchmarks for sharp medically relevant autosomal genes. Nat Biotechnol (2022). https://doi.org/10.1038/s41587-021-01158-1
Secure citation
Got:
Permitted:
Published:
DOI: https://doi.org/10.1038/s41587-021-01158-1