RNA-Seq a hot topic at last month’s ISMB conference

Poster – A50
Gene expression analyses of RNA-Sequencing data across multiple cancers to identify Basal-like cancer subtypes
Kevin Thompson, Mayo Clinic, United States
Xiaojia Tang, Mayo Clinic, United States
Jason Sinwell, Mayo Clinic, United States
Peter Vedell, Mayo Clinic, United States
Travis Dockter, Mayo Clinic, United States
Vera Suman, Mayo Clinic, United States
James Ingle, Mayo Clinic, United States
Richard Weinshilboum, Mayo Clinic, United States
Judy Boughey, Mayo Clinic, United States
Liewei Wang, Mayo Clinic, United States
Matthew Goetz, Mayo Clinic, United States
Krishna Kalari, Mayo Clinic, United States
Short Abstract: Introduction:
Basal-like breast cancer is identified using expression profiling techniques, such as the PAM50 intrinsic signature model. Basal-like breast cancer is a subgroup of breast cancer associated with poor prognosis, defects in homologous recombination, sensitivity to platinum and Parp inhibitors, and resistance to standard chemotherapy. Recently, the PAM50 model was applied to 5 non-breast cancers, and identified basal-like populations. Characterizing basal-like core gene signatures could offer novel therapies for multiple populations of cancer patients.Methods:
We obtained RNA-Seq gene expression data from The Cancer Genome Atlas data portal: 488 lung adenocarcinoma, 483 lung squamous, 262 ovarian, 584 breast, 497 head and neck, and 386 colon samples. The PAM50 intrinsic gene signature was used to elucidate basal-like cancers, which were substantial in head and neck (96%), squamous lung (88%), ovarian (73%), in addition to the basal-like breast cancer (20%). Consensus cluster and cluster validation analysis were performed on each cancer cohort using the most variable gene expression.Results and Future Work:
We have identified 1,214 basal-like cancers, representing 45% of the cancers, and confirming that the intrinsic PAM50 gene signature elucidates basal-like cancers in non-breast cancer samples. The basal-like ovarian cancer cohort consisted of 2 clusters, while the other basal-like tumors had 3 clusters each. Establishing correlations between cluster centroid prediction models is to be done and core gene signatures to be elucidated. These studies may identify non-breast cancers in which regimens with known efficacy in basal breast cancer could be tested.

Poster – A70
Legacy Microarray Data in the RNA-Seq Era – A Biomarker Investigation
Zhenqiang Su, National Center for Toxicological Research of US FDA, United States
Hong Fang, NCTR/FDA, United States
Huixiao Hong, NCTR/FDA, United States
Leming Shi, Fudan University, China
Binsheng Gong, FDA/NCTR, United States
Joe Meehan, FDA/NCTR, United States
Joshua Xu, FDA/NCTR, United States
Weigong ge, FDA/NCTR, United States
Roger Perkins, FDA/NCTR, United States
Weida Tong, FDA/NCTR, United States
Short Abstract: We systematically evaluated the transferability of predictive models and gene signatures between microarray and RNA-Seq using a large clinical data set. We demonstrated that predictive models and gene signatures between microarray and RNA-Seq are mutually transferable. The results suggest that existing microarray data can be synergistically used with RNA-Seq data

Poster – B31
Correction of Expression Irregularity in RNA-Seq
Ehsan Tabari, University of North Carolina at Charlotte, United States
Zhengchang Su, University of North Carolina at Charlotte, United States
Short Abstract: High-throughput sequencing of RNA, RNA-Seq, provides unprecedented insight to transcriptome complexity. It has replaced the methods that measure gene expression, is widely used to investigate non-coding RNA, and plays a major role in revealing tissue and condition specific alternative splicing in eukaryotes and alternative operons in prokaryotes. Most of the existing RNA-Seq analysis pipelines assume that RNA reads are uniformly distributed along a transcribed region. However, recent works have demonstrated that this assumption does not hold since a variety of sources introduce bias in read distribution across sequencing protocols and species. Local GC content, cleavage, priming and adapter ligation preferences, and possible RNA secondary structures are possible causes of such bias. It has been shown that such biases drastically affects the transcriptome landscape, and fixing for them produces better expression level correlation between replicate experiments. However, only a few methods have been introduced to address this issue, among which cufflinks, mseq and genominator are noteworthy. Here, we introduce a new computational model that detects and corrects the biases introduced in the experimental steps independently. We show that this multistep model outperforms existing approaches and improves downstream RNA-Seq analysis.

Poster – B40
RNA-QC-Chain: Comprehensive and fast quality control for RNA-Seq data
Kang Ning, Qingdao Institute of BioEnergy and Bioprocess Technology, Chinese Academy of Sciences, China
Qian Zhou, Qingdao Institute of Bioenergy and Bioprocess Technology, Chinese Academy of Sciences, China
Xiaoquan Su, Qingdao Institute of Bioenergy and Bioprocess Technology, Chinese Academy of Sciences, China
Gongchao Jing, Qingdao Institute of Bioenergy and Bioprocess Technology, Chinese Academy of Sciences, China
Short Abstract: RNA-Seq has become one of the most widely used applications based on next-generation sequencing. Quality control (QC) is the critical first step to ensure obtaining reliable RNA-Seq analysis results from downstream analysis. Here we report RNA-QC-Chain, a parallel and complete QC solution specifically designed for RNA-Seq data. RNA-QC-Chain can accomplish the data QC on three levels (modules): (1) read-quality assessment and trimming; (2) detection and filtration of rRNA reads and possible contamination; (3) alignment quality assessment (including read number, alignment coverage, sequencing depth, alignment region and pair-end read mapping statistics). The processing speed of RNA-QC-Chain is very fast since most of the QC procedures are optimized based on parallel computation.

Poster – C21
Transcriptome Profiling of Rattus norvegicus Embryonic Stem Cells by RNA-sequencing
Nathan Johnson, University of Missouri, United States
Elizabeth Bryda University of Missouri, United States
Nripesh Prasad, University of Alabama/Hudson Alpha Institute of Biotechnology, United States
Andi Dhroso, University of Missouri/MU Informatics Institute, United States
Elizabeth Bryda, University of Missouri/College of Veterinary Medicine/Rat Resource Center, United States
Dmitry Korkin, University of Missouri/MU Informatics Institute/Life Science Center, United States
Shawn Levy, Hudson Alpha Institute of Biotechnology, United States
William Spollen, University of Missouri DNA Core, United States
Short Abstract: The first Rattus norvegicus (rat) embryonic stem cells (rESCs) were isolated in 2008 and they promise to become an important tool for producing genetically engineered rat models for biomedical research. Despite their usefulness, little characterization of rESCs has been done and the transcriptome has not been defined. Deep RNA sequencing (RNA-seq) analysis was performed on mRNA from DAc8 the first male germline competent rat ESC line to be described and the first to be used to generate a knockout rat model. Furthermore, Homo sapiens and Mus musculus ESC transcriptomes were determined to gain insight into ESC expression patterns across species. Avadis and Tuxedo pipeline were used to quantify all three species’ transcriptomes. Using orthologues of all three species, Toppfun was used to determine significant gene ontology (GO) terms and pathways. Oct4 expression has been demonstrated to be imperative in order to maintain the ESC state. Using an experimentally determined mESC Oct4 interaction network, all three species’ transcriptomes were compared. To attempt to understand the species’ differences in the Oct4 interaction network, DOMMINO was used to explore protein binding differences. The gene expression profile of these rESCs was determined, novel isoforms were identified, and expression profiles for human, mouse, and rat ESCs were compared. In summary, mouse and human ESCs expressed ~50% more transcripts then rESCs as well as several genes associated with maintaining the ESC state not being expressed in rESCs.

Poster – F05
Boosting RNA-Seq analysis of alternative splicing via high-precision exon junction detection
Alberto Gatto, Centro Nacional de Investigaciones Cardiovasculares, Spain
Carlos Torroja-Fungairiño, Centro Nacional de Investigaciones Cardiovasculares, Spain
Fatima Sánchez-Cabo, Centro Nacional de Investigaciones Cardiovasculares, Spain
Enrique Lara-Pezzi, Centro Nacional de Investigaciones Cardiovasculares, Spain
Short Abstract: Analysis of alternative splicing is one of the most challenging applications of RNA-Seq. In spite of the large number of available tools, systematic benchmarking efforts showed that splice junctions false discovery rates, annotation usage and multi-mapping reads remain critical issues in spliced reads alignment. We aimed at dissecting the problem by assessing the splice-site mapping, detection and quantification performance under a variety of experimental designs, alignment strategies and annotation sets. Based on the results, we propose a pipeline to bridge the gap between the superior mapping accuracy of transcriptome-first alignment and the low false-positive rates of intron-centric approaches. Our strategy constrains the detection problem at the post-processing stage, by coupling TopHat2 to a novel method, FineSplice, that allows identifying unreliable gapped alignments and filtering out false-positive junctions via semi-supervised logistic regression. We further show how this strategy can benefit RNA-Seq analysis of alternative splicing and improve isoform-level quantification.

Poster – F30
An RNA-Seq transcript quantification method that is robust to sequencing biases
Bo Li, University of California at Berkeley, United States
Lior Pachter University of California at Berkeley, United States
Lior Pachter, University of California at Berkeley, United States
Short Abstract: RNA sequencing (RNA-Seq) enables the study of transcriptomes in an unprecedented way. One of its important applications is transcript abundance quantification. The most common assumption made by existing tools is that the RNA-Seq reads are sequenced uniformly across the transcriptome. However, in real data this assumption is often violated due to various biases introduced in sequencing library preparation steps, such as PCR amplification and reverse transcription. Most available bias correction methods are designed for a specific type of bias and therefore are not general enough. Building off of existing quantification methods, we present a novel method that is robust to sequencing biases. Unlike many other approaches addressing the issue of bias, our approach can handle multi-mapping reads, which is important for RNA-Seq transcript quantification. Furthermore, it does not rely on strong modeling assumptions. Using simulated data sets, we showed that our approach outperforms existing quantification tools significantly when strong sequencing biases are present.

Poster – F45
Discriminating alternative transcripts in quantitative expression profiling by RNA-Seq and high-density microarrays – insights from a controlled multi-site cross-platform study by the FDA SEQC/MAQC-III consortium
David Kreil, Boku University Vienna, Austria, Austria
Pawel Labaj, Boku University Vienna, Austria, Austria
The SEQC Consortium, US FDA and others,
Short Abstract: A major promise of RNA-seq is the extension of expression profiling to the discovery and quantification of alternative transcripts. For transcript-specific profiling, however, no large-scale expression data from other technologies are available as an external reference point. We here present results from a multi-site cross-platform comparison of transcript-specific measurements.
We focused on a test set of 782 genes with multiple alternative transcripts of varying complexity and specifically selected to represent the full subset of spliced genes annotated in the AceView database. Covering 5,691 alternative transcripts, this test set allows a first comparison of transcript-specific expression level estimates from RNA-seq and high-resolution transcript-level microarray data. To this end, we combined multiple metrics for a robust characterization of platforms, sites, and data processing options. This is necessary because each metric shows a different and platform specific response to signal strength. For RNA-seq the response increases with transcript expression level and read depth. The read depth at which average RNA-seq performance meets or exceeds that of another platform thus directly depends on the examined metric and the distribution of expression strength and differential signal in the samples measured.
We found that efficient transcript-specific measurements with good precision on microarrays for quantitative expression profiling could complement the power of RNA-seq in the discovery and identification of new alternative transcripts. In other words, the novel transcripts found by RNA-seq can lead to efficient measurements with good precision on microarrays, which can in turn aid in the confirmation and functional study of new transcript variants.

Poster – F46
Improving the reliability of RNA-Seq differential expression analysis – a controlled multi-site cross-platform study by the FDA SEQC/MAQC-III consortium
Pawel Labaj, Boku University Vienna, Austria, Austria
David Kreil, Boku University Vienna, Austria, Austria
Short Abstract: In the US FDA-led SEQC (i.e., MAQC-III) project, different sequencing platforms were tested across more than ten sites using well-established reference RNA samples with built-in truths in order to assess the discovery and expression-profiling performances of platforms and analysis pipelines.
Studies on microarrays have shown that results of typical statistical differential expression tests thresholded by p-value need to be filtered and sorted by effect strength (fold-change) in order to attain result that are robust across platforms and sites. We have shown that in RNA-seq studies a similar approach is also required. For RNA-seq, removing small fold-changes as well as excluding low-expression measurements reduced the false discovery rate considerably and, in general, gave an improvement over microarrays at similar sensitivity. These filters also achieved good inter-site agreement of lists of differentially expressed genes, with the performance of several (but not all) RNA-seq pipelines becoming comparable to that of microarrays. Even though a direct comparison of absolute expression levels across platforms was not possible, the filters yielded good agreement of differential expression calls between platforms (for example, A vs B on HiSeq 2000 compared to A vs B on SOLiD), suggesting that differential expression analyses from different platforms could be combined – for example, to extend existing studies with additional samples.

Poster – F47
Sequencing Quality Control (SEQC) – An FDA-led consortium effort for assessing RNA-Seq
Weida Tong, NCTR/FDA, United States
Short Abstract: Emerging methodologies such as next-generation sequencing contribute to our understanding of disease and health. Rapid progress over the last few years have moved these technologies from an exploratory to an applied stage, and an increasing amount of data derived from such approaches is received by regulatory agencies supporting the evidence for the safety and efficacy of new medical products. The realization has spawned a number of FDA efforts to utilize these technologies through integrated bioinformatics within inter-center and cross-community collaborations. This presentation is to discuss how the FDA led community wide MicroArray Quality Control (MAQC) makes an attempt to address the technical performance issues for these emerging biomarker technologies. Specifically, the third phase of MAQC, also known as the SEquencing Quality Control (SEQC) project, developed a comprehensive plan to assess the power and limitations of NGS with a substantial effort to compare RNA-Seq with microarrays (a mature transcriptomic technology). The project involved >200 participants from >80 organizations. Importantly, the project generated large RNA-Seq data sets covering a broad range of biological samples (human, rat and reference samples). Many critical issues of applying RNA-Seq in clinic and safety evaluation were evaluated and discussed with these datasets. This presentation will provide an overview and main conclusions of the SEQC project.

Poster – F48
SEQC Evaluation of the Performance of Microarrays and RNA-seq
wenzhong xiao, Stanford/MGH, United States
weihong xu, stanford, United States
anthony Schweitzer, affymetrix, United States
leming shi, Fudan U, China
Short Abstract: SEQC Evaluation of the Performance of Microarrays and RNA-seq

Weihong Xu1, Anthony Schweitzer2, Leming Shi3, SEQC consortium, Wenzhong Xiao14. 1Stanford Genome Technology Center, 2Affymetrix Inc, 3Fudan University, 4Massachusetts General Hospital

The goal of SEQC consortium is to assess the technical performance of both platforms by generating benchmark datasets with reference samples and to evaluate advantages and limitations of various bioinformatics strategies in RNA and DNA analyses. Here we utilized a comprehensive RNA-Seq dataset of four titration pools from two human reference RNA samples generated by the SEQC consortium and systematically evaluated RNA-Seq and several commercially available microarrays (Affymetrix Hu133plus2, PrimeView, HuGene2.0), in terms of reproducibility, accuracy and detection power for both gene- and exon-level analyses. We found that different microarrays are comparable to RNA-Seq at different read depth, contingent on performance metrics. While with sufficient read depth RNA-Seq slightly outperforms microarray for absolute quantification, both platforms are comparable for relative quantification. RNA-Seq shows stronger expression level dependent trend, while microarray is generally reproducible across its whole dynamic range. Further analyses of the titration order and the linear relationship among mixture samples suggest that microarrays can recover the ground truth correctly. The new exon-junction array HTA2.0 shows competitive strength for exon-level analysis that matches to RNA-Seq at 1-2 HiSeq lane per sample.

Poster – G09
MITIE: Simultaneous RNA-Seq-based Transcript Inference and Quantification in Multiple Samples
Jonas Behr, Suiss Federal Institute of Technology Zurich, Switzerland
Andre Kahles, Memorial Sloan-Kettering Cancer Center, United States
Gunnar Rätsch, Memorial Sloan-Kettering Cancer Center, United States
Short Abstract: High throughput sequencing of mRNA (RNA-Seq) led to expect tremendous improvements
in detection of expressed genes and transcripts. However, the immense dynamic range
of gene expression, biases from sequencing, library preparation and read mapping, and
the unexpected complexity of the transcriptional landscape cause profound computational
challenges. The latter can lead to a combinatorial explosion of the number of
potential transcripts that can qualitatively explain the observed read data. To find the
correct set of transcripts, long range dependencies have to be resolved.
Based on simple toy examples we can show that state of the art tools fail to resolve
these dependencies even if sufficient information is provided.By treating the transcript recognition problem as a combinatorial optimization problem we
disclose a great arsenal of techniques that cannot be applied in continuous optimization
Firstly, a set of up to k transcripts which gives the optimal quantitative explanation for
the observed RNA-Seq reads can be computed without enumerating all possible
transcripts. Secondly, sparsity can be enforced by penalizing the number of transcripts
needed to quantitatively explain the reads.
Thirdly, we can share information among multiple RNA-Seq samples and thereby provably increase
the power to resolve long range dependencies.These conceptual improvements translate to substantial gains in transcript recognition
performance, which we show on carefully simulated reads for the human genome and in a
retrospecive study on the drosophila modENCODE data set consisting of 53 RNA-Seq samples.

Poster – H03
Detection of fusion genes in RNA-seq data
Vladan Arsenijevic, Seven Bridges Genomics, United States
Nemanja Ilic, Seven Bridges Genomics, United States
Federica Torri, Seven Bridges Genomics,
Short Abstract: Recent advances in genomics have shown that gene fusions, also known as chimeras, play an important role in cancer development. In this work the thorough analysis has been made
to address questions that might help improve robustness, sensitivity and overall performances of current fusion genes detection software, as well as how these found chimeras can be represented. Different bioinformatics tools were tested using several RNA-seq samples, yielding, in some cases, significant discrepancies between the results. These differences have been attributed to different gene annotations used in the downsampling analysis.
Suggestions have been made to alert the scientific community of these particular issues that may lead to inconsistencies between identified fusion genes from different tools.

Poster – H19
Extensive trans and cis-QTLs revealed by large scale cancer genome analysis of The Cancer Genome Atlas RNA-seq, WGS-seq and WXS-seq data
Kjong-Van Lehmann, Memorial Sloan Kettering Cancer Institute, United States
Andre Kahles, Memorial Sloan Kettering Cancer Institute, United States
Cyriac Kandoth, Memorial Sloan Kettering Cancer Institute, United States
William Lee, Memorial Sloan Kettering Cancer Institute, United States
Nikolaus Schultz, Memorial Sloan Kettering Cancer Institute, United States
Robert Klein, Memorial Sloan Kettering Cancer Institute, United States
Oliver Stegle, European Bioinformatics Institute, United Kingdom
Gunnar Rätsch, Memorial Sloan Kettering Cancer Institute, United States
Short Abstract: While population structure can be one of the most severe confounding factors in QTL analysis, tumor samples open up many new additional challenges. Tumor specific somatic mutations and recurrence patterns are known to explain large amounts of the observed transcriptome variation and sample heterogeneity can lead to spurious associations. Thus, we have developed a new strategy to perform a common variant association study (CVAS) using mixed models on tumor samples which enables us to account for tumor specific genotypic and phenotypic heterogeneity as well as population structure. We apply this strategy to investigate the relationship between germline and somatic variants as well as splicing patterns and expression changes in order to discover determinants of transcriptome variation. Due to sample size constraints, many QTL studies have been limited to the analysis of cis-associated variants. We use whole genome, exome and RNA-seq data from the TCGA project to overcome this limitation and discover trans-associated variants as well. We also investigate the effect of rare somatic variants that may have a significant effect on transcriptional and post transcriptional regulation. A rare variant association study (RVAS) using variants from whole genome and exome sequencing data is being utilized to investigate the basis of rare mutations. A decomposition of genomic covariances into trans and cis effects elucidates the importance of such factors across different cancer types which will not only improve our understanding of the molecular basis of cancer but may also provide new treatment targets.

Poster – N13
Context-based mapping of RNA-seq data with ContextMap 2.0
Thomas Bonfert, Institute for Informatics, Germany
Gergely Csaba, Institute for Informatics, Germany
Ralf Zimmer, Institute for Informatics, Germany
Caroline C. Friedel, Institute for Informatics, Germany
Short Abstract: Sequencing of RNA (RNA-seq) using next generation sequencing technology has effectively become the standard approach for profiling the transcriptomic state of a cell. This requires mapping of millions of sequencing reads to determine their transcriptomic origin. Recently, we developed a context-based mapping approach, ContextMap, which determines the most likely origin of a read by evaluating the context of the read in terms of alignments of other reads to the same genomic region. While the original implementation of ContextMap focused on improving mappings provided by other RNA-seq mapping tools, we recently extended this into a standalone version using a modification of the Bowtie short read aligner.
Here, we present ContextMap 2.0, an extension of the original ContextMap method, which now allows to use alternative short read aligners without modification. Currently, ContextMap 2.0 explicitly supports Bowtie, Bowtie2 or BWA, but other short read alignment programs can be easily included into the ContextMap workflow. This allows improving accuracy of RNA-seq mapping in a straightforward way by replacing the internal alignment program by improved short read alignment approaches.
While the initial ContextMap version was already very accurate compared to other state-of-the-art approaches, we now additionally improved accuracy by adding new mapping strategies to significantly reduce false discovery rates. Furthermore, sensitivity was increased by implementing novel methods to detect reads spanning over an arbitrary number of exons or containing insertions or deletions. Finally, the design of ContextMap 2.0 allows for massively parallelized data processing, resulting in reasonable running times despite the higher complexity of the context-based approach.

Poster – N20
Benchmark Analysis of Algorithms for Reconstructing Full Splice Forms from RNA-Seq
Katharina Hayer, University of Pennsylvania, United States
Angel Pizarro, University of Pennsylvania, United States
Gregory Grant, University of Pennsylvania, United States
John Hogenesch, University of Pennsylvania, United States
Short Abstract: A serious difficulty of RNA-Sequencing analysis derives from the highly fragmented nature of the data. It’s straightforward to quantify exons and junctions; however combining this local information to assess expression of full-length splice forms is highly complex. There are published algorithms offering partial solutions, one of which, Cufflinks, is currently widely used. The accuracy of these methods is difficult to assess, due to a lack of benchmarks. However, simulated data can provide upper bounds on the accuracy. We have developed a simulator (BEERS), which is highly effective at assessing transcript reconstruction applications. The simulator mimics the discrete operations that produce paired-end reads. We have assessed the accuracy of Cufflinks and the other algorithms, as functions of the number of expressed splice forms for a given gene. We first provide a 100% accurate alignment, to establish an upper bound on the accuracy, as well as alignments with TopHat, RUM and GSNAP. We present our findings as false-positive and false-negative rates with regards to transcript structure, and as assessments of the error of the assessed FPKM values. When there is only one splice form that is correctly annotated, abundantly expressed, sequenced without error, and perfectly aligned to a non-polymorphic genome, the algorithms are capable at detecting it. However, when factors are introduced such as incomplete annotation, multiple splice forms, or alignment artifacts, the accuracy drops precipitously. We conclude that the current published algorithms are probably not effective enough to be practical, underscoring the need for further algorithmic development and funding.

Poster – N43
Classify RNA-seq runs as origin organs or other features by using machine learning
Yasunobu Okamura, Tohoku University, Japan
Takeshi Obayashi, Tohoku University, Japan
Kengo Kinoshita, Tohoku University, Japan
Short Abstract: Published RNA-seq data is increasing rapidly today. Although many RNA-seq data and other short read data are registered in SRA, annotations of these data are typically not enough to perform re-analysis. Usually, descriptions of runs, samples, experiments and studies are written in natural language often with abbreviated form, not in machine friendly form. To perform large-scale re-analyses, such as meta-analysis or gene-coexpression analysis, comprehensive machine-friendly annotations are required.
In this study, we automatically classify RNA-seq runs based on read count of genes into some features, such as 1) organs, 2) cell lines or 3) whether it is a tumor. Using support vector machine for 574 runs from 17 organs, we succeeded to predict 85.9% of runs correctly. We applied a similar procedure to select cell lines and whether it is a tumor. Original cell lines of 95.2% of 228 runs were correctly predicted. Also cell statuses (tumor or normal) were correctly predicted in 90.1 % of 111 runs. Our results are useful for large-scale re-analysis and annotating RNA-seq data. For example, gene-coexpression in a single organ will be useful to understand pathways and systems in the organ. We will also report contribution of genes to decide classification of the runs.

Poster – N49
Estimation of Isoform-specific and Allele-specific Expression from RNA-seq Data of Genetically Diverse Population
Kwangbom Choi, The Jackson Laboratory, United States
Gary Churchill The Jackson Laboratory, United States
Narayanan Raghupathy, The Jackson Laboratory, United States
Steven Munger, The Jackson Laboratory, United States
Daniel Gatti, The Jackson Laboratory, United States
Short Abstract: The Diversity Outbred (DO) mouse population is a new heterogeneous stock derived from the eight Collaborative Cross (CC) founder strains. The DO mice have uniformly high levels of heterozygosity and genetic diversity, and thus provide a high-resolution mapping resource for identifying key genetic factors underlying complex traits and disease.

As each DO is a unique animal with a large number of SNPs and indels, aligning its RNA-seq reads to the reference is problematic. Misalignment due to unaccounted strain variation is common, and furthermore, it is difficult to derive accurate estimates of allele specific expression (ASE) from a single reference alignment strategy. For more accurate estimation of ASE in the DO, we align each read against one search space that includes custom transcript sequences from all eight CC founder strains. For each read, all and only the best alignments will be reported. In our EM-framework, we gradually reach the best posterior probability that describes where each read originates, based on the alignment profile of each read as well as the summary on how all reads align globally.

Just as we borrow information from across genes in many transcriptome analyses for a better estimation of expression mean and variance, we also borrow information of each transcript from across samples. The main idea is to balance between individual and population-level estimates using shrinkage estimators. The degree of shrinkage gets lower when there exist higher level of dispersion since data is suggesting that each individual is less likely to follow a single population-level distribution.

Poster – N57
Normalization of RNA-Seq Data for Differential Expression Analysis
Gregory Grant, University of Pennsylvania, United States
John Hogenesch, University of Pennsylvania, United States
Katharina Hayer, University of Pennsylvania, United States
Emanuela Ricciotti, University of Pennsylvania, United States
Eun Ji Kim, University of Pennsylvania, United States
Elisabetta Manduchi, University of Pennsylvania, United States
Tilo Grosser, University of Pennsylvania, United States
Short Abstract: A “normalization problem” in RNA-Seq data analysis is any effect in the data which adversely affects the Type I or Type II error which can be mitigated algorithmically. Due to the nature of sequencing data, as one feature contributes more reads, the rest of the features necessarily contribute fewer. Thus variability in any given feature introduces variability globally. The common normalization known as the FPKM only attempts to normalize for depth-of-coverage and feature length. For the application of differential expression analysis, depth-of-coverage is the only factor that’s being addressed by the FPKM normalization. We have identified seven additional factors that introduce considerable global variability into feature quantifications. These are: ribosomal content, mitochondrial content, the balance of exonic and non-exonic signal, the balance between intronic and intergenic signal, the fragment length distribution, the balance between unique and multi-mappers, and finally the 3′ biasedness of the signal. We further propose methods to normalize for all of these factors based in read reasampling. We demonstrate, using a data set with six experimental conditions, each with eight biological replicates, the extent to which these factors contribute significantly to variation. We compare our method with the standard pipeline Tophat/CUFFLINKS/CUFFDIFF which demonstrates a considerable gain in the power of the statistical analysis. Software to perform the analysis is freely available through a git hub repository.

Poster – O30
Guidance for RNA-seq co-expression network construction and analysis: safety in numbers
Sara Ballouz, Cold Spring Harbor Laboratory, United States
Short Abstract: RNA-seq offers profound biological and technical advantages over microarray technologies, most usefully being able to detect the whole transcriptome. Although differential expression analysis is a more common means for interpreting transcriptomic data, co-expression analysis is far more routine in the context of meta-analysis, with thousands of expression profiles aggregated to generate robust signatures using repurposed data. Co-expression methods are already available to allow the meta-analysis of disparate datasets with quite different properties, subsuming most of the ambiguities that still exist in analyzing RNA-seq data.
Co-expression analysis is typically based on the correlation (or similar) of expression levels from microarray data sets and we applied a parallel approach to characterizing the extant public RNA-seq data. We used 50 separate RNA-seq experiments across 1,970 individual samples and a union of 30,705 RNA species (20,027 coding and 10,678 non-coding) to generate a reference co-expression network. Each node of the network represents an RNA species, and the edges are weighted by their correlation of expression.
We demonstrate that the network generated from RNA-seq encodes known biology (as captured by GO, KEGG and Reactome) through systematic recapitulation of functional connectivity. Perhaps surprisingly, we find the known dependency in microarray co-expression on sample sizes is almost identical in RNA-seq, suggesting hundreds to thousands of samples are necessary to obtain strong co-expression network performance. Further to this, we also find a high dependence of co-expression on the read depth per sample.

Poster – O55
Network-based Transcript Quantification with RNA-Seq Data
Wei Zhang, University of Minnesota Twin Cities, United States
Rui Kuang, University of Minnesota Twin Cities, United States
Jeremy Chien, University of Kansas Medical Center, United States
Baolin Wu, University of Minnesota Twin Cities, United States
Kay Minn, University of Kansas Medical Center, United States
Hui Zheng, Guangzhou Institutes of Biomedicine and Health, China
Lilong Lin, Guangzhou Institutes of Biomedicine and Health, China
Short Abstract: High-throughput mRNA sequencing (RNA-Seq) provides valuable information for accurate transcript quantification. In this project, we introduce a Network-based method for RNA-Seq-based Transcript Quantification (Net-RSTQ) to integrate protein domain-domain interaction information with short read alignment for transcript abundance estimation. Based on the observation that the abundances of the neighboring transcripts by domain-domain interactions in the network are positively correlated, Net-RSTQ models the expression of the neighboring transcripts as Dirichlet priors on the likelihood of the observed read alignments against the transcripts in one gene. The transcript abundances of all the genes are then jointly estimated with a heuristic alternating optimization algorithm. We demonstrate in the experiments that (1) qRT-PCR confirmed that Net-RSTQ achieves better transcript quantification accuracy with RNA-Seq data from a stem cell line and an ovarian cancer cell line compared with the models without using transcript network; and (2) the transcript abundances estimated by Net-RSTQ are more informative for patient sample classification tested on the RNA-Seq data of ovarian cancer, breast cancer and lung cancer in The Cancer Genome Atlas (TCGA). Availability: http://arxiv-web3.library.cornell.edu/abs/1403.5029

Poster – P09
Identifying Differential Expression in RNA-Seq Studies with Unknown Conditions
Thomas Unterthiner, Johannes Kepler University, Austria
Günter Klambauer, Johannes Kepler University, Austria
Sepp Hochreiter, Johannes Kepler University, Austria
Short Abstract: Detection of differential expression in RNA-Seq data is currently limited to studies in which two or more sample conditions are known a priori. However, these biological conditions are typically unknown in cohort, cross-sectional and nonrandomized controlled studies such as the HapMap, the ENCODE or the 1000 Genomes project. We present DEXUS for detecting differential expression in RNA-Seq data for which the sample conditions are unknown. DEXUS models read counts as a finite mixture of negative binomial distributions in which each mixture component corresponds to a condition. A transcript is considered differentially expressed if modeling of its read counts requires more than one condition. DEXUS decomposes read count variation into variation due to noise and variation due to differential expression. Evidence of differential expression is measured by the informative/noninformative (I/NI) value, which allows differentially expressed transcripts to be extracted at a desired specificity (significance level) or sensitivity (power). DEXUS performed excellently in identifying differentially expressed transcripts in data with unknown conditions. On 2400 simulated data sets, I/NI value thresholds of 0.025, 0.05 and 0.1 yielded average specificities of 92, 97 and 99% at sensitivities of 76, 61 and 38%, respectively. On real-world data sets, DEXUS was able to detect differentially expressed transcripts related to sex, species, tissue, structural variants or quantitative trait loci.

(find out more…)