Sequencing Today and Tomorrow: Big Data Arrives in Genomics
Updated: Aug 22
In this post we’re going to talk about sequencing and genomics, and I hope it’s as educational for you as it was for me.
Let’s start by putting sequencing in perspective. There are a variety of methods we use in molecular pathology, each of which tells us different information about the genome of our patient. I find to helpful to think of molecular methods as dividable into two bins: methods that give us structural information and methods that give us sequence information. Remember that the genome is not just a string of base pairs; all that DNA is eloquently wrapped around histones and packed into chromatin, which is further organized into supercoiled DNA, which ultimately becomes a chromosome. The genome itself occurs on a scale from base pair to chromosome, and similarly we need methods that can give us information across this scale.
In the structural bin, we have karyotypes and fluorescence in situ hybridization (FISH). Karyotypes involve the collection of condensed chromosomes at metaphase, staining thereof (several different methods for this), and then arrangement into chromosome pairs for analysis of differences. FISH involves the hybridization of fluorescently labeled probes, but in order to get good detection of that probe it has to be fairly large, between 1 mega base pair and 100 kilo base pairs. Both karyotype and FISH are good for structural information like translocations, as well as medium-large insertions and deletions.
In the sequence bin we have sequencing, which we will talk about more in a minute, and chromosomal microarray (CMA). Like FISH, CMA involves the hybridization of probes, but their scale is much smaller – only 5-10 kilo base pairs. Because of this, CMA can give limited information on sequence. However, it is really best for copy number changes, such as unbalanced translocations, and smaller deletions and insertions. It’s important to remember that CMA cannot detect a balanced translocation, and for this you will need to use either karyotype or FISH.
Karyotype | FISH | CMA | Sequencing | |
Useful for | Chromosome | Chromatin | Chromatin/ Base pair | Base pair |
Resolution (in base pairs, bp) | 5 mega bp | 1 mega - 100 kilo bp | 5-10 kilo bp | 1 bp |
Good for | Translocations, Large deletions and insertions | Translocations, Medium deletions and insertions | Unbalanced translocations, Smaller deletions and insertions, Limited base pair data | Smallest deletions and insertions, Single base pair changes |
Bad for | Sequence data | Sequence data | Balanced translocations, Structural information | Structural information |
Table credit: Caitlin Raymond
Then there is sequencing, which can only give information about the base pairs in string of DNA, and cannot give information about structure. It can detect the smallest deletions and insertions, as well as single base pair changes. Sequencing began with the Sanger method, which is labor intensive and limited to about 1 kilo base pair of output. However, new sequencing platforms have come on the market, and we’ll talk about two of particular interest here: next generation sequencing (NGS) and nanopore sequencing.
There are multiple platforms for NGS, which vary in their technical details. What they have in common is that each platform uses massively parallel sequencing of many small DNA fragments attached to a solid surface. Sequencing may occur by adding only one nucleotide at a time and assessing for its incorporation (pyrosequencing, ion semiconductor sequencing), or by adding a mixture of reversibly terminal nucleotides conjugated to a unique florescent probe (sequencing by synthesis). Regardless of the technical details, NGS systems output a massive amount of sequencing data, much more so than the Sanger sequencing method that preceded them. All that data can be collated to sequence large segments of DNA rapidly and efficiently, including whole genomes.
Nanopore sequencing, in contrast to most forms of NGS, does not utilize a DNA polymerase. Instead, a helicase sits atop a nanopore that crosses a membrane barrier. The helicase extrudes a single strand of the DNA through the nanopore, which has an ionic current running through it. Each nucleotide makes a predictable change in the current as it passes through the narrowest aperture of the nanopore, and the sequence is determined. Nanopore sequencing can produce reads for hundreds of kilo bases, and across stretches of DNA that are not easily sequenceable through other methods, such as telomeres and centromeres.
In order to understand the role of these newer sequencing technologies in genomics, it helps to know a little of the history of the human genome. The original Human Genome Project launched in 1990, and in 2003 they announced that they had sequenced 92% of 10 samples. That remaining 8% included difficult to sequence regions like telomeres, and just recently in 2022 the Telomere to Telomere project released the first ever fully sequenced human genome. You’ll also note that only 10 samples were sequenced in the Human Genome Project, hardly a representative sample. In 2008, the 1,000 Genomes Project was launched, aiming to fully sequence 1,000 samples of the human genome from around the globe.
Image credit: Caitlin Raymond
To make sense of all this data, the Encyclopedia of DNA Elements (ENCODE) Project was launched, aiming to produce a comprehensive database of all functional elements in the human genome. The National Institutes of Health also stated two centers to investigate the role of genetics in human disease. The Common Disease Genomics center aims to understand the role of genetics in common diseases such as diabetes and high blood pressure. In contrast, the Centers for Mendelian Genomics aims to further study the role of specific genes in the development of inherited diseases, such as cystic fibrosis.
In terms of classifying the results of this data, two terms are commonly used, but frequently misunderstood. A polymorphism, or single nucleotide polymorphism (SNP), is a sequence change at a single base pair that is present in ≥ 1% of the reference genome. A variant is a sequence change present at < 1% of the reference genome. Obviously, as more genomes and sequences are added to the reference database, we’ll have a better understanding of which sequence changes are SNPs and which are variants.
Variants are often investigated for links to disease states, and are currently classified on a five-point scale: benign, likely benign, unknown significance, likely pathogenic, and pathogenic. Variants and their assigned classification can be found in the ClinVar database, which is freely available online. With so much yet unknown about the human genome and how it influences disease, it can come as little surprise that assigning a category to a variant is challenging, and sometimes variants are reclassified based on updated studies in the scientific literature.
Most often, a variant with unknown significance (VUS) is reassigned to either the benign or pathogenic categories. In a recent study, Veenstra et al. found that most VUS are being reclassified as benign as we learn more about the diversity of the human genome [1]; however, some still are being reclassified as pathogenic. SoRelle et al. published similar findings, and moreover found that the rate at which VUS are being reclassified is steadily increasing [2]. In 2018, Mersch et al. found that the average time to reclassification of a VUS dropped from a mean of ~2.5 years to less than 1 year between 2006 and 2016, with no sign of slowing down [3].
This raises an important question: if a patient was notified of a VUS in their clinical sequencing results, and the status of that VUS changes, do we have a duty to inform the patient? A minority of clinical genetics centers are already doing so. In a 2018 survey of 105 genetics centers, 26 (or ~25%) responded that they were routinely recontacting patients if a VUS in their results was reclassified [4]. However, the issue of consent for recontacting has not been fully addressed. When do patients consent to recontact for updated information about their clinical genetics results? What if they do not consent? In a survey about their preferences regarding recontact, 50.4% of patients declined to receive updates about their results, commonly citing concerns about insurability [5]. Another as yet unanswered question is who will be responsible for updating the patient? In a 2019 statement, the American College of Medical Genetics suggested the ordering provider bear chief responsibility, but that patients, consulting geneticists, clinical labs, and even research laboratories all shared some responsibility in making this possible.
Image credit: Caitlin Raymond
To summarize, big data has arrived in genomics with new sequencing technologies enabling the production of huge datasets. With all this data our understanding of polymorphisms and particularly variants is rapidly changing, and there is ongoing debate about how to convey this to patients.
I’d like to close with some thought for the societal impact of big data in genomics. Technology in molecular pathology is racing ahead, with societal customs and our legal system struggling to keep up. The next 10 years will be critical to lay a fair groundwork for who gets access to this data and how this data is used.
"One of my concerns has been the limits on applications of our understanding of the genome. Should there be limits? I think there should. I think the public has expressed heir concenr about ways this information might be misused." - Francis Collins
1. Veenstra, D. L., Rowe, J., Pagán, J. A., Brown, H. S., Schneider, J., Gupta, A., ... & Appelbaum, P. S. (2021). Reimbursement for genetic variant reinterpretation: 5 questions payers should ask. The American journal of managed care, 27(10), e336.
2. SoRelle JA, Thodeson DM, Arnold S, Gotway G, Park JY. Clinical Utility of Reinterpreting Previously Reported Genomic Epilepsy Test Results for Pediatric Patients. JAMA Pediatr. 2019;173(1):e182302. doi:10.1001/jamapediatrics.2018.2302
3. Mersch J, Brown N, Pirzadeh-Miller S, Mundt E, Cox HC, Brown K, Aston M, Esterling L, Manley S, Ross T. Prevalence of Variant Reclassification Following Hereditary Cancer Genetic Testing. JAMA. 2018 Sep 25;320(12):1266-1274. doi: 10.1001/jama.2018.13152. PMID: 30264118; PMCID: PMC6233618.
4. Sirchia F, Carrieri D, Dheensa S, et al. Recontacting or not recontacting? A survey of current practices in clinical genetics centres in Europe. Eur J Hum Genet. 2018 Jul;26(7):946-954. doi: 10.1038/s41431-018-0131-5. Epub 2018 Apr 23. PMID: 29681620; PMCID: PMC6018700.
5. Henrikson NB, Scrol A, Leppig KA, Ralston JD, Larson EB, Jarvik GP. Preferences of biobank participants for receiving actionable genomic test results: results of a recontacting study. Genet Med. 2021 Jun;23(6):1163-1166. doi: 10.1038/s41436-021-01111-2. Epub 2021 Feb 18. PMID: 33603197; PMCID: PMC8194390.
6. David, K.L., Best, R.G., Brenman, L.M. et al. Patient re-contact after revision of genomic test results: points to consider—a statement of the American College of Medical Genetics and Genomics (ACMG). Genet Med 21, 769–771 (2019). https://doi.org/10.1038/s41436-018-0391-z
コメント