Next generation sequencing: Application of next generation sequencing to preclinical cancer model profiling
Posted: 15 December 2013 | James R. Bradford
Preclinical cancer models allow us to gain insight into therapeutic potential and mechanism of anti-cancer agents early in the drug discovery process. Whilst traditional array-based approaches have made a significant contribution to the characterisation of these models, the advent of next generation sequencing has revolutionised genomic research and is anticipated to make a huge impact on our understanding of preclinical models, leading to more targeted therapies for cancer patients. This article provides an overview of next generation sequencing in the context of cancer model profiling and evaluates the choice of technologies available and their application to both in vitro and in vivo model characterisation.
Preclinical cancer models are critical to the development of anti-cancer therapeutics and advancing our understanding of cancer biology. They are consistently used as a platform to investigate therapeutic mechanism of action and identify potential biomarkers prior to clinical trials in which similar exploration is often complicated, unethical and expensive. Key to the continued relevance of preclinical models in early drug development is the provision of more detailed information on their molecular characteristics leading to deeper understanding of disease drivers, insight into the alignment of these models to human disease segments and provision of a richer pool of potential biomarkers. For this reason, an increasing number of pharmaceutical companies, academic institutions and model providers are turning to next generation sequencing (NGS) to profile both their in vitro and in vivo preclinical cancer models. NGS (also known as massively parallel nucleotide sequencing or second generation sequencing) involves high-throughput generation of gigabases of sequence data at a relatively low cost per residue. A variety of platforms exist, but all rely on the generation of a large number of relatively short sequences known as ‘reads’ that can then be aligned to a target database, or assembled de novo into contiguous sequences. As a result, the sequencing of whole transcriptomes, exomes and genomes has now become feasible causing a shift from more focused approaches such as capillary-based (Sanger) sequencing and DNA/RNA microarrays to comprehensive genome-wide analysis.
Whilst NGS has significantly impacted molecular biology in general, it has profound implications for understanding cancer and efforts to improve diagnosis and treatment. This is because the majority of cancers are triggered by an accumulation of genomic alterations such as single nucleotide variations (SNVs), copy number changes (amplifications / deletions), and chromosomal rearrangements (inversions / translocations). The high sequence coverage (repeated sequencing of the genomic locations under study) commonly achieved by NGS makes it particularly applicable to the detection of low frequency genomic alterations prevalent in heterogeneous cancer samples, novel chromosomal rearrangements and copy number alterations at high resolution.
NGS technologies for preclinical cancer model characterisation
Several NGS approaches are available to profile cancer models and the choice of technology will depend on the type and scope of question being asked, anticipated lifespan of the data and budget. A common practice is to use a combination of two or more approaches, exploiting the strengths of each to capture the most information at lowest cost. Three of the most popular NGS technologies for cancer model profiling are briefly discussed below and summaries in Table 1.
Table 1: Common options for NGS characterisation of pre-clinical cancer models
Coding SNVs/small indels
Non-coding SNVs/small indels
Structural variants and other rearrangements
Splice isoform usage pattern
£ / ££
1 Refer to caveats in main text
2 Possible if affects targeted exons and comparative data available
3 If captured by structural re-arrangement
4 Approximate costs per sample: £=200-500GBP, ££=700-1200GBP, £££=>2000GBP
Whole genome sequencing
Assuming sufficient coverage, whole genome sequencing provides the most complete profile of a cancer genome allowing the detection of SNVs, copy number changes and chromosome structure rearrangements in a single sequencing run. Since sequencing is not simply restricted to coding regions, whole genome sequencing allows discovery of mutations in regulatory regions such as promoters and enhancers, other non-coding regions such as microRNAs, as well as previously unexplored loci. Copy number changes can also be detected at high resolution with clear breakpoint definition, removing the need for an additional array-based copy number detection experiment. Despite these benefits, whole genome sequencing can be expensive compared to more targeted sequencing approaches due to the amount of sequencing required to achieve robust statistical confidence in aberration calls, especially for cancer genomes in which sample heterogeneity and ploidy need to be accounted for. In common with other sequencing approaches, whole genome sequencing can also suffer from potential sequence bias at GC-rich regions, and repetitive sequences are problematic due to reduced probability of achieving unique read alignment at these loci.
Targeted sequencing offers increased sequence coverage at regions of interest at lower cost than whole genome sequencing. Most methods involved a capture step in which DNA or RNA baits hybridise and enrich for specific regions of interest in the total pool of nucleic acids. These regions are then amplified to undergo massively parallel sequencing. Whilst any fraction of the genome can be targeted, including non-coding regions, the most common approaches target either a small panel of genes of specific interest (targeted deep sequencing) or the exome (whole exome sequencing).
Targeted deep sequencing provides extremely high sampling of a small fraction of the genome resulting in statistically robust aberration calls and low frequency mutation detection albeit across a limited number of regions of interest. In theory, any location can be targeted, including non-coding loci, and fusion detection is possible if the breakpoint is known. More common is the use of a standard cancer panel such as Illumina’s TruSeq Amplicon Cancer Panel, or Life Technologies’ Ion AmpliSeq Cancer Hotspot Panel, designed to cover mutational hotspots across 48 and 50 oncogenes and tumour suppressor genes respectively. If budget is limited, such panels are usually sufficient to provide a high confidence set of somatic mutation calls across an established cancer associated gene set. Recent innovations such as Agilent Technologies’ Haloplex allow custom design of larger panels comprising 200-500 genes and, whilst generally more expensive than standard panels, offer a lower cost alternative to exome sequencing for more hypothesis-led exploration.
Exome sequencing offers a broader targeted approach and since exons comprise only one per cent of the genome, uses considerably less raw sequence than whole genome sequencing to achieve equivalent coverage at lower cost. It therefore represents a cost effective alternative to whole genome sequencing for mutation detection across coding regions and is proving a popular option for model characterisation. Current limitations include potential inefficiency in the targeting process that can result in missed exons although this is expected to improve as the technology matures.
Whilst whole genome and targeted sequencing approaches sequence genomic DNA, transcriptome sequencing (also known as RNA-Seq) sequences cDNA derived from RNA species such as mRNA or miRNA. In transciptome sequencing, the set of reads generated during a sequencing run is treated as an unbiased sampling of the total nucleotide complement of the cells, making it possible to use the number of reads aligning to a given transcript as a measure of its expression level. RNA-Seq offers several technical advantages over arrays including greater sensitivity and dynamic range and the avoidance of probe effects. In addition to RNA quantification, RNA-Seq can be used to detect expressed transcript variants including splice isoforms and gene fusions. RNA-Seq may also be used an alternative to exome sequencing for mutation calling but this carries a number of caveats including lack of statistical power in calls from genes expressed at low levels, missed mutation calls in genes with undetectable expression, and false positive calls resulting from reverse transcriptase errors and RNA editing. Nevertheless, RNA-Seq offers rich information content at costs becoming increasingly competitive with array platforms.
Considerations for NGS mutation calls across preclinical cancer models
Inherent in all NGS technologies is the potential for false positive and false negative outputs. For example, sequencing errors and read misalignments can result in false positive mutation calls whereas false negative calls can result from insufficient coverage, particularly in more heterogeneous cancer samples. Both can be ameliorated by increased sequencing depth, and with use of appropriate parameters, most mutation calling software have some capability to distinguish sequencing errors from genuine mutations. Once a set of high confidence mutation calls has been established, a further challenge is to distinguish somatic from germline mutations. With clinical samples, the majority of germline mutations can be detected by comparison between tumours and matched normal mutation calls. However, this is not usually possible with preclinical models, particularly cell lines. In these cases, one option is to compare the predicted mutations against public single-nucleotide polymorphism (SNP) databases such as dbSNP1 or the 1000 Genomes Project2 to remove previously described variants that occur naturally in the human population. However, these databases are becoming increasingly populated with somatic mutations, thus some cancer-related mutations could be incorrectly discarded. An alternative approach is to consider minor allele frequency alongside the database searches. Germline mutations have expected minor allele frequencies of either 50 per cent for heterozygous events or 100 per cent for homozygous events, whereas somatic mutation allele frequencies are influenced by tumour heterogeneity, ploidy and local copy number resulting in allele frequencies anywhere between 0 – 100 per cent. Therefore, by only removing common SNPs with a minor allele frequency greater than one per cent, many germline mutations can be filtered out with minimal loss of somatic variants. Finally, a useful method of establishing clinical relevance of a somatic mutation detected in a preclinical model is to compare with known mutations across clinical samples in databases such as The Cancer Genome Atlas3.
Next generation sequencing of in vitro cancer cell line models
Cultured cancer cells remain the most commonly used preclinical models despite limitations such as sub-optimal modelling of the in vivo tumour microenvironment and inability to study the effects of the body on drug distribution and metabolism. Large panels of cell lines have therefore been the subject of several profiling initiatives, each providing a comprehensive collection of information at the level of RNA and DNA together with drug-sensitivity profiles across hundreds of cell lines covering a range of cancer types. Whilst these initiatives have generated extensive array based datasets, NGS is becoming increasingly exploited to supplement already valuable information. For example, the Cancer Cell Line Encyclopedia4 project used hybrid capture followed by targeted deep sequencing to detect mutations across a panel of 1651 genes in ~1000 cell lines. Whole exome sequencing data has been released across the NCI-605 panel of 59 cell lines from nine different tissues and will soon be available across the Sanger cell line panel6. So far, mainly untreated cell lines have been characterised through the initiatives highlighted although the number of smaller scale studies that have used NGS to detect dynamic markers of response to compound treatment and understand therapeutic mode of action continues to grow. Examples of these can be found in public databases such as the Gene Expression Omnibus7 and the European Nucleotide Archive8.
Next generation sequencing of in vivo cancer models
In vivo models such as xenografts established from either cancer cell lines or patient-derived tumour tissue are commonly used to model response to targeted therapeutics, and the intrinsic or acquired resistance mechanisms that can limit therapeutic benefit. Both offer several advantages over cell line cultures such as more accurate modelling of the tumour microenviroment, drug metabolism and distribution. Patient derived tumour models (or explants) provide additional benefits since these are not grown on plastic or adapted to culture conditions at any stage. As a consequence, many of the original tumour characteristics are retained such as heterogeneity, clinical molecular signature, and architecture and as such they better represent the patient population. Furthermore, many explants can be established for disease segments not represented by cell lines. However, explant establishment is costly and often dependent upon histological type, therefore a bias exists towards high grade tumours and untreated patient samples. The genetic background of many explants models is poorly characterised and profiling efforts are hampered by samples containing a mixture of human tumour and surrounding mouse host tissue. To address the former, many explants providers are using NGS approaches to improve characterisation of their models, most commonly using targeted deep sequencing of a small panel of genes to identify driver mutations. Accurate separation of tumour and host has recently been demonstrated by RNA-Seq9 making feasible the study of agents that impact both the tumour and stroma in a single sequencing run without the need for specialist experimental protocols to separate human and mouse genomic material. Differentiating the effects on the tumour and its surrounding tissue is critical to the development of a clinically relevant understanding of new therapeutic activity.
The application of NGS to preclinical cancer model characterisation is still in its infancy with many more studies and innovations anticipated. For example, in addition to the technologies described above, NGS also offers the capability to characterise methylation, histone packaging and regulatory protein binding positions. Therefore, an important goal will be the systematic integration of such a broad spectrum of data from different NGS technologies enabling more accurate evaluation of a model’s clinical relevance through comparison with clinical samples that have undergone similar analyses. While sequencing costs continue to decrease, NGS profiling of preclinical cancer models is likely to become increasingly routine. This in turn presents a major challenge in the provision of sufficient computational infrastructure and domain knowledge to process and interpret the wealth of data, and build a more complete understanding of preclinical models which ultimately translates into therapeutic benefit for the cancer patient.
Thank you to Hedley Carr (AstraZeneca) for permission to use the original concept for Table 1.
- Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. (2001) dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29: 308-11
- The 1000 Genomes Project Consortium (2012) An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65
- The Cancer Genome Atlas Research Network (2008) Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 455, 1061–1068
- Barretina, J. et al. (2012) The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483, 603–607
- Abaan et al (2013) The Exomes of the NCI-60 Panel: A Genomic Resource for Cancer Biology and Systems Pharmacology Cancer Res 73: 4372
- Garnett, M. J. et al. (2012) Systematic identification of genomic markers of drug sensitivity in cancer cells. Nature 483, 570–575
- Barrett et al (2013) NCBI GEO: archive for functional genomics data sets–update. Nucleic Acids Res. 4: D991-5
- Bradford et al. (2013) RNA-Seq differentiates tumour and host mRNA expression changes induced by treatment of human tumour xenografts with the VEGFR tyrosine kinase inhibitor cediranib. PLOS One, 10.1371/journal.pone.0066003
James Bradford gained a PhD from the University of Leeds in 2001 in developing novel approaches to study protein-protein interactions. He continued in Leeds as a post-doctoral researcher shifting focus to machine learning applications in gene function prediction motivated by data generation in genomics and proteomics. James then moved to the Paterson Institute for Cancer Research, Manchester where one of his primary roles was developing and implementing Next Generation Sequencing workflows leading to publication of the first RNA-Seq/Exon array platform comparison study. Since 2011, he has been a Senior Oncology Bioinformatics Scientist at Jameast driving new target research, and preclinical model and Next Generation Sequencing informatics capability builds.