Whole-Genome Sequencing

Reviewed on June 28, 2024

Introduction

The purpose of the whole-genome sequencing module is to provide an overview of whole-genome sequencing and next-generation sequencing, including historical, technical and utilization perspectives.

What is Whole-Genome Sequencing?

The NCI defines whole-genome sequencing in humans as “a laboratory process that is used to determine nearly all of the approximately 3 billion nucleotides of an individual’s complete DNA sequence, including non-coding sequence.” The focus of this module is on whole-genome sequencing in humans.

Whole-genome sequencing was originally performed for the human genome using Sanger sequencing and took more than a decade and more than $1 billion. Today, we use newer technology referred to as “next-generation sequencing” or “massively parallel sequencing” and also known as “high-throughput sequencing.” These techniques can sequence both DNA and RNA faster and cheaper than traditional Sanger sequencing and,…

Introduction

The purpose of the whole-genome sequencing module is to provide an overview of whole-genome sequencing and next-generation sequencing, including historical, technical and utilization perspectives.

What is Whole-Genome Sequencing?

The NCI defines whole-genome sequencing in humans as “a laboratory process that is used to determine nearly all of the approximately 3 billion nucleotides of an individual’s complete DNA sequence, including non-coding sequence.” The focus of this module is on whole-genome sequencing in humans.

Whole-genome sequencing was originally performed for the human genome using Sanger sequencing and took more than a decade and more than $1 billion. Today, we use newer technology referred to as “next-generation sequencing” or “massively parallel sequencing” and also known as “high-throughput sequencing.” These techniques can sequence both DNA and RNA faster and cheaper than traditional Sanger sequencing and, typically, take a few days to perform with costs around $1,000.

Sanger Sequencing

Frederick Sanger was born in 1918 in England and is the winner of not one but two Nobel Prizes in chemistry. His first Nobel Prize was awarded in 1958 for his work developing a method to read the amino acid sequence of insulin, and his second was awarded in 1980 for his work in developing the DNA sequencing method now referred to as Sanger Sequencing, or “first-generation sequencing.” Sanger sequenced the first full genome, a virus called phiX174.

Sanger sequencing was used to sequence the first human genome in the Human Genome Project, one of the largest international collaborative projects, published in Nature in 2001 after more than a decade’s work by scientists around the globe. Currently, Sanger sequencing is used to validate next-generation sequencing data today when the mutation is at high enough percentage or for projects focused on single genes or regions. Sanger sequencing does not have the same sensitivity as more recent next-generation sequencing methods. Thus, Sanger is used for confirmation when the frequency of the mutation is at least 25%. For lower frequency in the sample, other methodology is required (e.g., digital polymerase chain reaction).

Whole-Genome Sequencing Methods

Sequencing technologies are unable to sequence the entire human genome at once. Thus, the genome must be broken into smaller chunks of DNA, sequenced and then put back together in the correct order using bioinformatics approaches. There are several methods of DNA sequencing, including clone-by-clone and whole-genome shotgun methods.

Clone-by-Clone

This method requires the genome to have smaller sections copied and inserted into bacteria. The bacteria then can be grown to produce identical copies, or “clones,” containing approximately 150,000 base pairs of the genome that is desired to be sequenced. Then, the inserted DNA in each clone is further broken down into smaller, overlapping 500 base pair chunks. These smaller inserts are sequenced. After sequencing is performed, the overlapping portions are used to reassemble the clone. This approach was used to sequence the first human genome using Sanger sequencing. This approach is time-consuming and costly, but it is reliable.

Whole-Genome Shotgun

As the name implies, “shotgun” sequencing is a method that breaks DNA into small random pieces for sequencing and reassembly. The pieces of DNA are also cloned into bacteria for growth, isolation and subsequent sequencing. Because the pieces are random, there are overlapping sequences that aid in reassembly into the original DNA order. This approach was originally used in Sanger sequencing but is now also used in next-generation sequencing methods providing rapid genome sequencing with lower costs. It is only good for shorter “reads” (ie, sequencing on shorter DNA fragments to be put back together again). Because it is reassembled based on overlapping regions and has shorter read lengths, it is best utilized when a reference genome is available, and it requires sophisticated computational approaches to reassemble the sequence. It also can be challenging for genomes with many repetitive regions.

Assembly of Sequencing Reads

Because genomes are sequenced in varying lengths of DNA fragments, the resulting sequences must be put back together. This is referred to as “assembly,” or “reassembly.” Two common approaches are de novo assembly and assembly by reference mapping.

De novo assembly is performed by identifying overlapping regions in the DNA sequences, aligning the sequences and putting them back together to form the genome. This is done without any sequence with which to compare. Mapping to a reference genome uses another genome to align new sequencing data to as a comparator.

Although de novo assembly can be challenging, this approach is the only one available for sequencing new organisms. Additionally, de novo assembly introduces results with less bias than mapping to a reference genome. Mapping to a reference genome is easier and requires less contiguous reads, but new or unexpected sequences can be lost. The sequence results obtained by this method is only as good as the reference genome chosen; however, it can provide better identification of single nucleotide polymorphisms (SNPs). Multiple institutions and genomic sequencing companies have invested considerable time and effort into creating improved reference genomes. Single nucleotide polymorphisms are known to vary by race and ethnicity, thus, multiple reference genomes have been created for various races/ethnicities.

Examples of Next-Generation Sequencing Platforms

Several companies focus on development and marketing of next-generation sequencing machines (often referred to as “platforms”) for use in whole-genome (and other) sequencing. Illumina is considered by many as the leader because of the number of users that utilize its systems. Illumina has multiple platforms depending on the need. The Illumina HiSeq is one of the more common sequencers found in laboratories, including major research institutions, companies providing next-generation sequencing services for clinics and labs, and pathology laboratories. It has a high throughput, capable of sequencing many genomes rapidly with reasonable costs. This instrument also can be used to look at copy number variation, as well as mutations and other alterations, and RNA expression levels to do transcriptomics. Because of the popularity in the clinic of targeted sequencing panels, which are much smaller with clinics requiring faster turnaround times for treatment of patients, Illumina created the MiSeq, which can provide same-day sequencing results for very small panels. Illumina also produced multiple variations to provide sequencers for each disease area optimizing output, turnaround time and costs for specific use cases.

Thermo Fisher Scientific’s Ion Torrent or Ion Proton uses a completely different technology based on detection of pH differences and was once expected to provide better utility for clinical applications because it was easier to use, cost less and provided faster turnaround time. However, Illumina countered with new machines to fit these needs. Consequently, both are found in research and clinical laboratories.

Other technologies developed recently use different novel approaches. A few examples are provided below.

Oxford Nanopore Technologies introduced the MinION, which enables anyone to sequence on a desktop computer using a USB device. The DNA is passed through a protein nanopore membrane for sequencing and detection by creation of an ionic current that varies based on the nucleotide.

Pacific Biosciences introduced its single molecule, real-time technology with the longest reads to date, with average read lengths of more than 10,000 base pairs compared with more than 150 base pairs. Single molecule, real-time technology uses a chip with single DNA molecules attached. Zero-mode waveguide technology enables isolation of a single nucleotide for the DNA polymerase to add fluorescent labels for detection of each base. The error rate of this instrument is still higher than some of the prior technologies, but a lot of interest has been generated, and there is hope that speed and costs can be further optimized with the new approach.

Coverage Breadth and Depth

Coverage refers to the number of reads that show a specific nucleotide in the reconstructed DNA sequence. A read is a string of A, T, C, G bases that correspond to the reference DNA. There are millions of reads in a sequencing run. Increased coverage depth results in increased confidence in variant identification.

For the human genome, a 10- to 30-times coverage depth is acceptable for detecting mutations, SNPs and rearrangements. A next-generation sequencing approach that provides a coverage depth of 30 times is considered to have high coverage. However, as coverage depth increases, coverage breadth decreases (Figure 1-22).

Enlarge  Figure 1-22:  Relationship between coverage breadth vs. coverage depth. Source: Elaine Mardis, PhD
Figure 1-22: Relationship between coverage breadth vs. coverage depth. Source: Elaine Mardis, PhD

Whole-Genome vs. Whole-Exome Sequencing vs. Targeted Sequencing Panels

Whole-genome sequencing determines the order of the nucleotides (A, C, G, T) in the entire genome that makes up an organism. The goal of whole-genome sequencing is, typically, to look for genetic aberrations (e.g., single nucleotide variants, deletions, insertions and copy number variants). Because the entire genome is being sequenced, changes in the noncoding sections of DNA within genes, called introns, can also be determined. Under normal conditions, introns are removed by RNA splicing during a post-transcriptional process, and alterations in these regions can be important to whether the DNA is transcribed into RNA or potentially result in a truncated, non-functional protein.

An alternative approach is to sequence only the exomes, called whole-exome sequencing. Exomes are the part of the genome formed by exons, or coding regions, which when transcribed and translated become expressed into proteins. Exomes compose only about 2% of the whole genome. Because the genome is so much larger, exomes are able to be sequenced at a much greater depth (number of times a given nucleotide is sequenced) for lower cost. This greater depth provides more confidence in low frequency alterations. Sequencing depth can become even greater for lower cost by using a targeted or “hot-spot” sequencing panel, which has a select number of specific genes, or coding regions within genes that are known to harbor mutations that contribute to pathogenesis of disease, and may include clinically-actionable genes of interest (eg, diagnostic, theranostic, etc.). These are often used in clinical care to provide greater confidence as well as keep the cost down and provide better opportunity for insurance reimbursement. However, whole-exome sequencing and targeted panels only see part of the story as they focus on reduced areas of the genome. Consequently, for some research projects or genetics testing, whole-genome sequencing may be advantageous.

Strengths and Limitations of Next-Generation Sequencing

Strengths

The major strength of next-generation sequencing is that the method can detect abnormalities across the entire genome (whole-genome sequencing only), including substitutions, deletions, insertions, duplications, copy number changes (gene and exon) and chromosome inversions/translocations. A major strength of next-generation sequencing is that it can detect all of those abnormalities using less DNA than required for traditional DNA sequencing approaches. Next-generation sequencing is also less costly and has a faster turnaround time.

Limitations

There are several limitations to using next-generation sequencing. Next-generation sequencing provides information on a number of molecular aberrations. For many of the identified abnormalities, the clinical significance is currently unknown. Next-generation sequencing also requires sophisticated bioinformatics systems, fast data processing and large data storage capabilities, which can be costly. Although many institutions may have ability to purchase next-generation sequencing equipment, many lack the computational resources and staffing to analyze and clinically interpret the data.

Time and Costs

The time to perform most next-generation sequencing methods and receive results has been greatly reduced. Starting from the day the laboratory receives the tumor specimen, it takes approximately 10 days for a physician to receive a whole-genome sequencing report.

Costs of sequencing the whole human genome have decreased significantly over the last decade. In 2006, the cost was approximately $20 million to $25 million. In 2016, the cost to sequence the human genome is generally less than $1,000.

Next-Generation Sequencing in the Research and Clinic Settings

Although whole-genome sequencing can identify an individual’s entire DNA sequence, only information from approximately 3% of the genome can be used in clinical practice; thus, whole-genome sequencing is more fruitful in the research setting than in the clinic. However, many actionable mutations can be identified using next-generation sequencing methods, including hot spot assessment, and a sampling of these mutations and corresponding FDA-approved targeted therapies are provided in the table below.

For patients, the costs of some next-generation sequencing methods available for clinical use to guide therapy decisions are now manageable, and many next-generation sequencing panels are covered by insurance for approved indications.

Importance of Bioinformatics

The field of computer science called bioinformatics is used to analyze whole-genome sequencing data. This involves algorithm, pipeline and software development, and analysis, transfer and storage/database development of genomics data.

A typical whole-genome sequencing workflow contains the following steps:

  1. quality control and data grooming;
  2. genome assembly and/or variant calling; and
  3. post-assembly analysis.

The volume of data that is produced from next-generation sequencing platforms is massive. Data collected pertains not only to the DNA sequencing results but also on the sequencing performance to assist with detection of errors or repetitive sequencing. This presents data management and storage issues. Additionally, special software and fast computing systems are required to process the immense data. Specialized, trained bioinformaticists are essential to the analysis of data generated by next-generation sequencing, as well as the continued success and growth of precision medicine.

 

References

  • Brown TA. Sequencing genomes. In: Genomes. 2nd ed. Oxford: Wiley-Liss; 2002. https://www.ncbi.nlm.nih.gov/books/NBK21117/. Accessed January 3, 2017.
  • Diergaarde D. Next-generation sequencing 101: Reading and interpreting the data. Head and Neck Specialized Program of Research Excellence Monthly Meeting. April 9, 2012. https://upci.upmc.edu/spore/headneck/eventArchive.cfm. Accessed March 15, 2017.
  • Ekblom R, Wolf JB. A field guide to whole-genome sequencing, assembly and annotation. Evol Appl. 2014;7:1026-1042.
  • Frederick Sanger: method man, problem solver. Scitable website. http://www.nature.com/scitable/topicpage/frederick-sanger-method-man-problem-solver-6537485. Accessed December 20, 2016
  • Genome Sequencing. Genome News Network website. http://www.genomenewsnetwork.org/resources/whats_a_genome/Chp2_1.shtml. Updated January 15, 2003. Accessed December 20, 2016.
  • Gonzalez-Garay ML. The road from next-generation sequencing to personalized medicine. Per Med. 2014;11(5):523-544. doi:10.2217/pme.14.34.
  • Koboldt DC, Ding L, Mardis ER, Wilson RK. Challenges of sequencing human genomes. Brief Bioinform. 2010;11(5):484-498. doi:10.1093/bib/bbq016.
  • Lander ES, Linton LM, Birren B, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409(6822):860-921. doi:10.1038/35057062.
  • Mardis ER. Whole-genome sequencing: new technologies, approaches, and applications. In: Ginsburg GS, Willard HF, eds. Genomic and Personalized Medicine. 2nd ed. Academic Press; 2013:87-93.
  • Mardis ER. Whole genome sequencing by next-generation methods: Genome-forward medicine. American Society for Clinical Pathology Annual Meeting. October 19-22, 2011; Las Vegas, NV.http://dn3g20un7godm.cloudfront.net/2011/AM11SA/93.pdf. Accessed March 15, 2017.
  • Rizzo JM, Buck MJ. Key Principles and Clinical Applications of “Next-Generation” DNA Sequencing. Cancer Prev Res (Phila). 2012;5(7):887-900.
  • Sequencing coverage. Illumina Inc. https://www.illumina.com/science/education/sequencing-coverage.html. Accessed March 1, 2017    
  • The cost of sequencing a human genome. National Human Genome Research Institute website. https://www.genome.gov/sequencingcosts/. Updated July 6, 2016. Accessed December 20, 2016.
  • Timeline: history of genomics. Yourgenome website. http://www.yourgenome.org/facts/timeline-history-of-genomics. Updated February 5, 2016. Accessed January 2, 2017.
  • Tsai E, Shakbatyan R, Evans J, et al. Bioinformatics workflow for clinical whole genome sequencing at Partners HealthCare Personalized Medicine. J Pers Med. 2016;6(1):12. doi:10.3390/jpm6010012.
  • Vnencak-Jones CL, Berger MF, Pao W. Types of molecular tumor testing. My Cancer Genome website. https://www.mycancergenome.org/content/molecular-medicine/types-of-molecular-tumor-testing/. Updated February 8, 2016. Accesssed February 9, 2017
  • Wetterstrand KA. DNA sequencing costs: Data from the NHGRI Genome Sequencing Program (GSP). National Human Genome Research Institute website. https://www.genome.gov/sequencingcostsdata. Accessed December 28, 2016.
  • Whole-genome sequencing. NCI Dictionary of Genetics Terms. National Cancer Institute website. https://www.cancer.gov/publications/dictionaries/genetics-dictionary?cdrid=740456. Accessed December 20, 2016.
  • Whole genome sequencing (WGS) vs. whole exome sequencing (WES). Genohub Blog website. https://blog.genohub.com/2015/02/21/whole-genome-sequencing-wgs-vs-whole-exome-sequencing-wes/. Posted February 21, 2015. Accessed December 20, 2016