New DNA sequencing technologies: applications, promises and challenges for pharmacogenomics

Share via

Posted: 30 July 2009 | Daniel G. MacArthur, Visiting Fellow, Wellcome Trust Sanger Institute supported by an Overseas Biomedical Fellowship from the Australian National Health and Medical Research Council | No comments yet

We are currently on the cusp of a technology-driven revolution in the field of genomics. The rapid evolution of DNA sequencing technology is already providing researchers with the ability to generate data about genetic variation and patterns of gene expression on an unprecedented scale; within just a few years it is likely that these technologies will allow accurate sequencing of complete human genomes to become a routine tool for researchers and clinicians. This review covers the emerging field of new DNA sequencing technologies, and outlines the potential benefits – and the challenges – of these technologies for pharmacogenomics.

The importance of DNA sequencing

DNA sequencing provides a useful read-out for a variety of sources of biomedically significant information. In its most straightforward application DNA sequencing can be used directly to uncover patterns of genetic variation for exploring associations with clinical traits. However, other types of biological information can also be converted into DNA for analysis with DNA sequencing technologies: for instance, RNA molecules can be reverse-transcribed into DNA, and patterns of cytosine methylation (an important epigenetic marker) can be converted into a form accessible to DNA sequencing through the process of bisulfite conversion. Improvements in DNA sequencing technology will thus lead to advances in our understanding of many different biological processes.

Second-generation sequencing platforms

Since the late 1970s DNA sequencing has been dominated by a single chemical approach, albeit in increasingly sophisticated formats: dideoxy terminator or “Sanger” sequencing1. In a modern and highly-automated format, dideoxy sequencing was the workhorse behind both competing efforts to sequence the human genome2,3 and remains the most widely-used sequencing technology today. However, several new so-called “second-generation” sequencing technologies offering tremendously higher throughput are now rapidly displacing the dideoxy approach.

Three second-generation technologies are currently commercially available, which differ in their underlying chemistry and many other parameters both from one another and from traditional dideoxy sequencing: the 454 platform offered by Roche (www.454.com), Illumina’s Genome Analyser technology (http://illumina.com/pages.ilmn?ID=204), and the SOLiD platform from Applied Biosystems (http://solid.appliedbiosystems.com/). A fourth second-generation short-read platform, developed by Complete Genomics (http://www.completegenomics.com/), is currently on the verge of commercial launch; notably, however, the company does not plan to make this platform available commercially, but rather intends to use the technology solely for the generation of complete human genome sequences from within its own custom-built sequencing facilities.

All four second-generation platforms share a conceptually similar approach to sequencing: (1) random fragmentation of input DNA and ligation of adaptors to fragment ends; (2) immobilisation of the DNA fragments on a solid matrix (or bead), followed by amplification to produce discrete clusters of identical molecules; and (3) sequencing of one or both ends of the resulting fragments through alternating cycles of substrate addition and imaging.

The most notable advantage of all three second-generation technologies over Sanger sequencing is throughput: all are capable of generating millions to billions of bases of sequence data in a single run, orders of magnitude more than could be generated by dideoxy sequencing. This massive increase in throughput permits DNA sequencing experiments to be performed on a scale far beyond those feasible with Sanger sequencing, at an ever-reducing cost. As an example, the sequencing of near-complete individual human genomes has now been performed with all three of the major second-generation technologies4,5,6.

The declining cost of sequencing is best illustrated by the fact that while the first draft human genome sequence (completed in 2001) cost several billion US dollars, Illumina recently launched a retail whole-genome sequencing service for under US$50,000 and Complete Genomics is promising a commercial price beginning at US$5,000. At the current rate of decrease many industry observers expect to see retail human genome sequences offered for under US$1000 within the next two years.

The massive throughput of new sequencing technologies does come at a cost, however: second-generation sequence data is generated as a series of fragmentary reads that are shorter and typically contain substantially more errors than those generated by Sanger sequencing. Both the read length and error rates of the new technologies are rapidly improving (for instance, read lengths for the Illumina platform have now increased from 35 bases to over 100 bases), but both still result in considerable informatics challenges during downstream analysis. While improvements to the existing second-generation platforms will certainly yield increasingly more accurate data, it is likely that the largest leaps in quality will come from the development of entirely novel technological approaches.

Third-generation sequencing platforms

Several sequencing platforms currently in development promise even greater advances in throughput and resolution. These are based on more diverse chemistries than second-generation platforms, but can be broadly characterised as offering two major advantages over currently commercially available platforms: substantially longer read lengths, and direct analysis of single DNA molecules.

A thorough comparison of third-generation sequencing technologies is difficult due to the rate of progress in the field and the limited information on operational performance currently available in the public domain. However, several emerging third-generation platforms are worth mentioning to highlight the diverse approaches being taken to the generation of DNA sequence: Pacific BioSciences (http://www.pacificbiosciences.com/), whose technology is based on real-time visualisation of the incorporation of fluorescently labelled bases into a single, immobilised DNA molecule; Oxford Nanopore Technologies (http://www.nanoporetech.com/), who rely on detecting the sequential passage of cleaved nucleotides from a DNA strand through a protein nanopore acting as an electrical sensor; and ZS Genetics (http://www.zsgenetics.com/), who plan to visualise DNA strands directly using electron microscopy.

It is currently unclear which of these approaches, if any, will ultimately become the default technology for large-scale DNA sequencing. However, it is clear that the development of any technology capable of generating very long independent reads from single molecules will substantially improve our ability to sequence human genomes: current short-read technologies are incapable of producing reliable sequence for the 10-15% of the human genome contained within highly repetitive regions, and also provide limited information about which of the two sister chromosomes in an individual carries a particular variant (so-called haplotypic phase). Complete reconstruction of individual genomes thus awaits the development of long-read single-molecule approaches.

Applications of new sequencing technology in pharmacogenomics

Massively parallel sequencing technologies are already altering almost every field of genetics, and pharmacogenomics will be no exception.

The most obvious application of new sequencing technologies in pharmacogenomics is the discovery of novel genetic variants that may influence drug response. Recent genome-wide association studies of drug responses (e.g. statin-induced myopathy7 and stable warfarin dose8) have revealed genetic variants explaining a surprisingly large proportion of the population variance in drug response. However, much residual variance remains to be explained in these and other pharmacologically relevant traits, and genome-wide association studies performed to date have only been well-powered to detect associations with common variants present at a population frequency of greater than 5%8.

It is likely that some non-trivial fraction of the population variance in drug response is due to genetic variants at a frequency below 5%, which may individually have large effects on drug efficacy and toxicity. Discovering all such variants in the population will ultimately require deep resequencing studies in which large numbers of individuals with varying drug responses are analysed using DNA sequencing technologies – initially characterising targeted regions of the genome with a high prior probability of playing a role in drug response (e.g. cytochrome P450 genes), and eventually expanding to analysis of complete genome sequences as the cost of sequencing drops.

Whole-genome sequencing also offers the potential of identifying a variety of other forms of genetic variation currently poorly captured by the chip-based platforms used for current genome-wide association studies: for instance, small insertions and deletions (“indels”), and larger rearrangements of DNA (so-called “structural variants”) involving the removal, duplication or inversion of thousands of bases of DNA. However, it should be noted that identifying both small indels and structural variants remains a non-trivial challenge with current short-read sequencing technologies, particularly in the highly repetitive regions of DNA where these variants are most common.

Moving beyond studies of genetic variation, advances in DNA sequencing technology will also permit fine-grained dissection of the dynamic processes involved in drug responses, such as gene expression and epigenetic modifications of DNA. Analysis of gene expression has already been transformed by the advent of whole-transcriptome sequencing, which allows relatively unbiased interrogation of the full range of RNA transcripts produced by a cell, as opposed to the subset of transcripts represented on microarray chips, and also provides direct information on alternative RNA splicing events that are difficult or impossible to capture using array-based methods9; application of such high-resolution surveys to cells exposed to pharmaceutical agents or disease-causing agents raises the possibility of identifying novel, specific drug targets.

Whole-genome analysis of epigenetic modifications (including cytosine methylation and the placement of specific DNA-binding proteins such as histones) can also be performed using high-throughput sequencing technologies10, raising the possibility of identifying all of the important epigenetic changes resulting from drug exposure or disease state. As our ability to specifically modify epigenetic states improves, such high-resolution maps will provide a framework for targeted interventions to reduce the effects of disease or counteract side-effects of existing medications.

Challenges of new sequencing technologies

The power of new sequencing technologies described above also brings with it considerable obstacles for new adopters. Currently, the establishment of sequencing facilities employing second-generation sequencing technologies requires heavy investment in purchasing sequencing equipment and associated infrastructure, recruitment of staff, and training. Once the equipment has been purchased the costs of maintaining and running a high-throughput sequencing facility are also extremely high; indeed, one of the problems with the new technologies is that the cost of each individual experiment can be very large, making trouble-shooting and methods development an expensive process.

However, perhaps the major challenges faced by any organisation seeking to employ these new sequencing technologies are informatic: the sheer scale of the data produced by current second-generation sequencing technologies is far greater than most research organisations are equipped to deal with, and developing the required infrastructure for data storage, processing and analysis represents a substantial fraction of the costs of these technologies. Even with substantial investment in informatics infrastructure, the volume of raw image data produced by the new technologies is often too large to archive and must instead be processed on-the-fly into digested formats. The routine discarding of raw data is one of the uncomfortable but unavoidable consequences of migrating into the new world of high-throughput sequencing.

Even with the appropriate hardware systems in place for coping with large-scale sequencing data, new users are faced with a bewildering array of both free and commercial packages for downstream analysis. While many of the packages that are currently most widely used for routine procedures such as mapping reads to the human genome11 and for analysis of genetic variation or gene expression12 are free, they can sometimes come with minimal documentation and assume substantial background knowledge. In addition, the rapidly evolving technology in the field means that analysis pipelines need to be constantly modified to deal with changing data formats and new algorithmic approaches, while ensuring that the ceaseless stream of new data rolling off sequencing platforms is not compromised. These challenges place a strain on even well-resourced informatics groups.

Finally, new users of high-throughput sequencing technologies need to be aware that these platforms bring with them brand new sources of bias and error, which must be carefully considered before drawing conclusions from the resulting data. Taking full advantage of quality control metrics and new tools for visualising data output will be crucial for any researcher seeking to use these technologies to gain insight into biology.

Conclusions

Advances in DNA sequencing technology promise nothing less than a transformation of many diverse areas of biology, allowing analyses of genetic variation, gene expression, DNA modification and other biological processes at unprecedented scale and resolution. This transformative potential also applies to the area of pharmacogenomics; however, researchers seeking to take full advantage of these rapidly evolving technologies will need to be mindful of the challenges ahead, particularly in terms of the infrastructure and expertise required for effective management of the massive volume of data generated by the new sequencing platforms.

References

Sanger, F. et al. 1977. Nucleotide sequence of bacteriophage phi X174 DNA. Nature 265:687-695.
Lander, E.S. et al. 2001. Initial sequencing and analysis of the human genome. Nature 409:860-921.
Venter, J.C. et al. 2001. The sequence of the human genome. Science 291:1304-1351.
Wheeler, D.A. et al. 2008. The complete genome of an individual by massively parallel DNA sequencing. Nature 452:872-876.
Bentley, D.R. et al. 2008. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456:53-59.
McKernan, K.J. et al. 2009. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two base encoding. Genome Res advance online publication.
SEARCH Collaborative Group et al. 2008. SLCO1B1 variants and statin-induced myopathy – a genomewide study. N Engl J Med. 359:789-799.
Takeuchi, F. et al. 2009. A genome-wide association study confirms VKORC1, CYP2C9, and CYP4F2 as principal genetic determinants of warfarin dose. PLoS Genet 5:e1000433.
Wilhelm, B.T. et al. 2008. Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution. Nature 453:1239-1243.
Lister, R. et al. 2008. Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell 133:523-536.
Li, H., Ruan, J., and Durbin, R. 2008. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 18:1851-1858.
Fejes, A.P. et al. 2008. FindPeaks 3.1: a tool for identifying areas of enrichment from massively parallel short-read sequencing technology. Bioinformatics 24:1729-1730.

Issue

Issue 4 2009, Past issues

Cookie	Description
cookielawinfo-checkbox-advertising-targeting	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertising & Targeting".
cookielawinfo-checkbox-analytics	This cookie is set by GDPR Cookie Consent WordPress Plugin. The cookie is used to remember the user consent for the cookies under the category "Analytics".
cookielawinfo-checkbox-necessary	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	This cookie is set by GDPR Cookie Consent WordPress Plugin. The cookie is used to remember the user consent for the cookies under the category "Performance".
PHPSESSID	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
zmember_logged	This session cookie is served by our membership/subscription system and controls whether you are able to see content which is only available to logged in users.

Cookie	Description
cf_ob_info	This cookie is set by Cloudflare content delivery network and, in conjunction with the cookie 'cf_use_ob', is used to determine whether it should continue serving “Always Online” until the cookie expires.
cf_use_ob	This cookie is set by Cloudflare content delivery network and is used to determine whether it should continue serving “Always Online” until the cookie expires.
free_subscription_only	This session cookie is served by our membership/subscription system and controls which types of content you are able to access.
ls_smartpush	This cookie is set by Litespeed Server and allows the server to store settings to help improve performance of the site.
one_signal_sdk_db	This cookie is set by OneSignal push notifications and is used for storing user preferences in connection with their notification permission status.
YSC	This cookie is set by Youtube and is used to track the views of embedded videos.

Cookie	Description
bcookie	This cookie is set by LinkedIn. The purpose of the cookie is to enable LinkedIn functionalities on the page.
GPS	This cookie is set by YouTube and registers a unique ID for tracking users based on their geographical location
lang	This cookie is set by LinkedIn and is used to store the language preferences of a user to serve up content in that stored language the next time user visit the website.
lidc	This cookie is set by LinkedIn and used for routing.
lissc	This cookie is set by LinkedIn share Buttons and ad tags.
vuid	We embed videos from our official Vimeo channel. When you press play, Vimeo will drop third party cookies to enable the video to play and to see how long a viewer has watched the video. This cookie does not track individuals.
wow.anonymousId	This cookie is set by Spotler and tracks an anonymous visitor ID.
wow.schedule	This cookie is set by Spotler and enables it to track the Load Balance Session Queue.
wow.session	This cookie is set by Spotler to track the Internet Information Services (IIS) session state.
wow.utmvalues	This cookie is set by Spotler and stores the UTM values for the session. UTM values are specific text strings that are appended to URLs that allow Communigator to track the URLs and the UTM values when they get clicked on.
_ga	This cookie is set by Google Analytics and is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. It stores information anonymously and assign a randomly generated number to identify unique visitors.
_gat	This cookies is set by Google Universal Analytics to throttle the request rate to limit the collection of data on high traffic sites.
_gid	This cookie is set by Google Analytics and is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected including the number visitors, the source where they have come from, and the pages visited in an anonymous form.

Cookie	Description
advanced_ads_browser_width	This cookie is set by Advanced Ads and measures the browser width.
advanced_ads_page_impressions	This cookie is set by Advanced Ads and measures the number of previous page impressions.
advanced_ads_pro_server_info	This cookie is set by Advanced Ads and sets geo-location, user role and user capabilities. It is used by cache busting in Advanced Ads Pro when the appropriate visitor conditions are used.
advanced_ads_pro_visitor_referrer	This cookie is set by Advanced Ads and sets the referrer URL.
bscookie	This cookie is a browser ID cookie set by LinkedIn share Buttons and ad tags.
IDE	This cookie is set by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.
li_sugr	This cookie is set by LinkedIn and is used for tracking.
UserMatchHistory	This cookie is set by Linkedin and is used to track visitors on multiple websites, in order to present relevant advertisement based on the visitor's preferences.
VISITOR_INFO1_LIVE	This cookie is set by YouTube. Used to track the information of the embedded YouTube videos on a website.

Recommended

New DNA sequencing technologies: applications, promises and challenges for pharmacogenomics

The importance of DNA sequencing

Second-generation sequencing platforms

Third-generation sequencing platforms

Applications of new sequencing technology in pharmacogenomics

Challenges of new sequencing technologies

Conclusions

References

Issue

Related topics

Recommended

New DNA sequencing technologies: applications, promises and challenges for pharmacogenomics

The importance of DNA sequencing

Second-generation sequencing platforms

Third-generation sequencing platforms

Applications of new sequencing technology in pharmacogenomics

Challenges of new sequencing technologies

Conclusions

References

Issue

Related topics

Leave a Reply Cancel reply