Monday, 17 October 2016

SIRVs: RNA-seq controls from @Lexogen

This article was commissioned by Lexogen GmbH.

My lab has been performing RNA-seq for many years, and is currently building new services around single-cell RNA-seq. Fluidigm’s C1, academic efforts such as Drop-seq and inDrop, and commercial platforms from 10X Genomics, Dolomite Bio, Wafergen, Illumina/BioRad, RainDance and others makes establishing the technology in your lab relatively simple. However the data being generated can be difficult to analyse and so we’ve been looking carefully at the controls we use, or should be using, for single-cell, and standard, RNA-seq experiments. The three platforms I’m considering are the Lexogen SIRVs (Spike-In RNA Variants), or SEQUINs, or ERCC 2.0 (External RNA Controls Consortium) controls. All are based on synthetically produced RNAs that aim to mimic complexities of the transcriptome: Lexogen’s SIRVs are the only controls that are currently available commercially; ERCC 2.0 is a developing standard (Lexogen is one of the groups contributing to the discussion), and SEQUINs for RNA and DNA were only recently published in Nature Methods.

You can win a free lane of HiSeq 2500 sequencing of your own RNA-seq libraries (with SIRVs of course) by applying for the Lexogen Research Award

Lexogen’s SIRVs are probably the most complex controls available on the market today as they are designed to assess alternative splicing, alternative transcription start and end sites, overlapping genes, and antisense transcription. They consist of seven artificial genes in-vitro transcribed as multiple (6-18) isoforms to generate a total of 69 transcripts. Each has a 5’triphosphate and a 30nt poly(A)-tail, enabling both mRNA-Seq and TotalRNA-seq methods. Transcripts vary from 191 to 2528nt long and have variable (30-50%) GC-content.

Want to know more: Lexogen are hosting a webinar to describe SIRVs in more detail on October 19th: Controlling RNA-seq experiments using spike-in RNA variants. They have also uploaded a manuscript to BioRxiv that describes the evaluation of SIRVs and provides links to the underlying RNA-Seq data. As a Bioinformatician you might want to download this data set and evaluate the SIRV reads yourself. Or read about how SIRVs are being used in single-cell RNA seq in the latest paper from Sarah Teichmann’s group at EBI/Sanger.

Before diving into a more in-depth description of the Lexogen SIRVs, and how we might be using them in our standard and/or single-cell RNA-seq studies, I thought I’d start with a bit of a historical overview of how RNA controls came about...and that means going back to the days when microarrays were the tool of choice and NGS had yet to be invented!

RNA quality control – MAQC: The use of controls is recommended in any experiment, and the lack of them is one of the oft cited reasons for the current reproducibility crises. Nearly everyone who’s worked on differential gene expression in the last fifteen years has heard of the MAQC (MicroArray Quality Control) study. Although four sources of RNA were evaluated Stratagene’s Universal Human Reference RNA and Ambion’s Human Brain RNA samples were chosen because of the number of genes expressed at a detectable level, and the size of the fold changes between the two samples. These two control samples were used to evaluate five microarray platforms, in an international project involving 137 participants from 51 organisations (see Nat Biotech 2006). Labs like mine adopted, and continue to use the MAQC controls in our differential gene expression pipelines, which today are almost all based on RNA-seq methods. We used them in my lab to show how detection sensitivity drops as RNA inputs are reduced to under 100ng (something I keep meaning to repeat with RNA-seq).

The move to RNA-seq has had a dramatic impact on our ability to perform complex experiments. We are no longer limited to asking questions about the differential expression of genes where we have sequence information available to make an array. RNA-seq allows us to analyse the whole transcriptome; to assess differential gene expression (oligo-dT enriched mRNA-seq is the most widely used method), as well as differential splicing, allele specific expression, polyA tail length, transcription initiation and termination, microRNA, lincRNA, etc, etc, etc (see my "wish list" for controls at the bottom of this post).

The MAQC controls we used are simply not up to the more complex job that RNA-seq presents. Both the ABRF and SEQC papers used MAQC samples, which are admixtures of multiple individuals (I discussed these limitations in a 2014 post), but both included the ERCC controls as well.

Newer, more carefully designed and manufactured controls are available that can better serve the needs to biologists; and this is where SIRVs come in.

The SIRV workflow: from sample to answer
RNA quality control – Lexogen and beyond: SIRVs are designed to represent much of, but not all of, the complexity of Eukaryotic transcriptomes e.g. differential gene expression, differential splicing, polyA tail length variation, GC content, etc. SIRVs are designed to be added to samples before RNA extraction, or starting the RNA-seq library prep. They should allow an objective assessment of the technical biases in library preparation, sequencing and analysis; and ultimately should improve our ability to make biological insights from comparison of experimental conditions. They are a huge leap forward from the MAQC controls, and a significant step ahead of the ERCC1.0 controls, which are restricted to single-exon transcripts.

How are SIRVs made: SIRVs were designed to be similar to Human gene structures with overlapping multi-exon genes that are transcribed in both sense and antisense, with alternative splicing and alternative transcription start and end sites. Genes are in-vitro transcribed from linearized plasmids to produce full-length transcripts which are subject to very careful quality control and quantitation. This includes spectrophotometric, molecular weight, and Agilent Bioanalyser analyses. After QC and QT SIRV transcripts are mixed at equimolar concentrations (E0), or at 8-fold (E1) or 128-fold (E2) variations.

Designing SIRVs: A comparison of SIRV1 and KLK5

How are SIRVs used: Spiking SIRVs into your samples requires some careful consideration of how you’ll use the data they provide in downstream assessment. Today the most important control in my lab is simply whether the library prep has worked, or more importantly where it did not work whether it was the lab or the sample that was the cause of the failure. Our use of MACQ controls on a plate of samples is great, but extending this to an internal control in every sample is going to be better. However I don’t want controls to dominate the experiment or they’ll add too much to the costs of library preparation and sequencing.

SIRVs themselves don’t need much data to generate useful results and around 1% of your sequencing reads should be sufficient for most labs. However determining how much SIRV mix to add to your samples before extraction, or your RNA before library prep can require some empirical testing as the amount of RNA in a sample or a cell differs so much. As a rule of thumb 95% of RNA is ribosomal RNA’s, and the other 5% is mRNA (and non-coding RNAs). For an experiment starting with 100ng of TotalRNA in an mRNA-seq workflow approximately 50pg would represent 1% of the 5ng of mRNA present.

SIRVs are available in three configurations E0, E1 & E2 that mix the  in vitro transcribed RNAs at equimolar (mix E0), up to 8-fold (mix E1), or up to 128-fold (mix E2), variation in concentration. Importantly SIRVs are built in a modular format and should be compatible to other spike in controls like the ERCC. Additional modules should address transcript lengths, polyA tail length variation, etc.

Coinciding with the webinar on October 19th, Lexogen will release the “SIRVs suite” (see "How are SIRVs analysed" below) for analysis of spike-in data. This will also include an "Experiment Designer" tool to calculate recommended spike-in ratios based on known or expected input for the RNA content, mRNA ratio, and type and efficiency of the workflow.

SIRVs in bulk RNA-seq: Bulk RNA-seq experiments can use SIRVs as process controls in place of the MAQC Brain and UHRR samples allowing a full 96 samples to be run on each plate. Assuming the 100ng TotalRNA input then just 50pg of SIRVs are needed per sample, with 5ng added to the oligo-dT master-mix used in the enrichment step. The use of SIRV E0 is recommended for process QC, but E1 and E2 may be useful when evaluating new methods for accuracy and precision of differential transcript detection and quantitation.

SIRVs in scRNA-seq: Single-cell RNA-seq has quickly adopted spike-in controls with Hashimshony et al presenting their use of ERCC spikes in the CEL­Seq protocol. Both Wu et al 2013 and Truetlein et al 2014 used the ERCC mixes at a 1:40,000 dilution spiked into the cell lysis mix of the Fluidigm C1 protocol. And Svensson et al use the ERCC and SIRV spike­in's to assess sensitivity and accuracy of various protocols across a standard analysis pipeline. This demonstrates the utility of using RNA control spike-ins, but also the requirement for careful dilution to avoid swamping single-cell RNA-seq experiments with control data, or not having enough to QC data before interpreting results. Assuming each single cell has around 20pg of TotalRNA then just 200fg of SIRVs are needed per sample, the amount of SIRV added, and exactly where to add it the protocol is highly dependent on the single-cell RNA-seq protocol being used.

How are SIRVs analysed: Lexogen will release the Galaxy-based “SIRVs suite” for uploading, evaluating and comparing spike-in data. This will allow SIRV users to compare results from their experiments to anonymised data, and should help determine if your own experiment is any good. Back in 2003/4 I developed rptDB: a tool to compare QC data between Affymetrix arrays. This had over 3500 samples submitted to it, and allowed a quick easy call on whether your data was "good" or "bad" - highly context dependant of courrse! As a user if I had received data from a core lab or service provider, or were downloading RNA-seq data for meta-analysis, then being able to select only data where SIRV, or other, controls had been used, and where results were shown to be high-quality, would most likely save me considerable time in cleaning up data before starting.

SIRVS are not designed to be used as a normalisation tool. Whilst spike-ins have been considered they are not really reliable enough for standard normalisation procedures. The development of novel normalisation algorithms appears to offer hope for the future (see Risso 2014), and approaches like this might be applicable to SIRVs. I suspect this will be an active area of algorithm development over the next couple of years because of the huge interest in single-cell RNA-seq.

The competition: alternative RNA-seq controls
Sequins: ‘Sequins’ (sequencing spike-ins) were developed by the Garvan Institute and recently published in Nature Methods. Sequins are conceptually similar to SIRVs. They are a set of synthetic RNA isoforms that align to an artificial in silico chromosome, with no homology to known genomes. They represent full-length spliced mRNA isoforms, at a range of concentrations. They can be used to assess differential gene expression and alternative splicing pipelines. The authors state that sequins can by used for normalisation, and refer to the same Nature Biotech as I did above. In their Nature Methods paper they do show some very nice results from scaling normalisation using sequins and I hope these results will ultimately be achieveable with any well-designed spike-in series.

In the back-to-back Nature Methods publications the team at Garvan show how sequins can be used in RNA-seq and DNA-seq experiments to asses biases and determine the limits of detection, quantitation and analytical methods. Sequin genes are mixed in a two-fold serial dilution, with a minimum three genes per dilution, to span an ~106-fold range. The team also developed 24 Sequins to represent cancer fusion genes and used these to assess fusion gene detection and quantitation. They also reported that split reads significantly outperformed read-pairs in their correlation with Sequin concentration – this has a significant impact on the sequencing format as many groups today use paired-end reads where longer single-end reads may be more sensitive, and would also be around 40% cheaper.

ERCC 2.0: the original ERCC1.0 controls are a mix of 92 relatively simple single-exon transcripts of varying length and GC content. They are used in a mix at known concentrations spiked­into samples before library preparation. ERCC2.0 aims to update the spikes to better represent the complexity of the transcriptome, and to provide FFPE derived controls. Again they are are conceptually similar to SIRVs and Lexogen were one of 9 groups invited to present at the 2014 NIST ERCC2.0 workshop at Stanford University.

Conclusions: The use of controls in RNA-seq experiments is an absolute requirement if you want to get the best out of your experiments. Bulk RNA-seq can benefit from a relatively simple data QC of the controls before moving onto more complex differential gene expression and splicing analyses. And including spike-in controls may allow easier comparison of longitudinal data sets, or between labs. Single-cell RNA-seq has shown an absolute requirement to include spike-ins, although the very latest papers suggest that spiked-in transcripts may not truly mirror Human mRNAs in the protocols used, due to much shorter poly-A tails (30 vs 200+bp), and that they may underestimate detection sensitivity by up to ten-fold.

SIRVs, more recently SEQUINs, and soon ERCC2.0 controls can be further enhanced and manufacturers should not be consider their job complete! With protocols like Pacific Bioscience’s ISO-seq and the advent of Oxford Nanopores direct RNA-sequencing longer and longer transcripts could be assessed and this will need to be controlled. Phased sequencing, possibly from long RNA molecules on 10X Genomics, is likely to need controls where variants are phased. Additionally PacBio and Nanopore sequencing also offer the ability to detect and quantify RNA base modifications. All of this shows how far the controls we might use still have to go.

My RNA controls wish list:
  1. differential gene expression normalisation
  2. differential splicing
  3. allele specific expression
  4. transcript and polyA tail length variation
  5. GC content
  6. transcription initiation and termination
  7. non poly-adenylated RNAs e.g. microRNA, lincRNA
  8. pseudogene mapping
  9. limits of detection
  10. RNA variant detection at different MAF
  11. High-quality and degraded FFPE RNA
  12. Spike-in's with corresponding baits for in-solution capture
  13. Spike-in RNA encapsulated in synthetic cells
  14. Phased variants on long RNAs
  15. RNA base modifications

Please let me know what you’d like to add by leaving a comment below.


  1. The Sequins paper should have used TPM instead of RLE as expression levels.

  2. Appreciation is the key to doing more that is why I have took some time out to thank some one who cured me of my 2 years HEPATITIS B problem. It became a major problem to me as it was affecting my marital life and I was no longer comfortable so I decided to look for a solution and I came across a post of Dr Ariba and how he has been helping people of the same problem I contacted him and told him all I have been facing in my life. He told me how to get his product and how to take it after every thing I find out that all was now okay with me and that my HEPATITIS B problem was gone that is why I have come out today to say thank you to him . if you need cure for HIV/AIDS, Herpes, cancer or diabetes of all types you can contact his email OR
    Whatsapp: +2348140439497