The surprising observation that virtually the entire human genome is transcribed means we know little about the function of many emerging classes of RNAs, except their astounding diversities. Traditional RNA function prediction methods rely on sequence or alignment information, which are limited in their abilities to classify the various collections of non-coding RNAs (ncRNAs). To address this, researchers from the University of Pennsylvania developed Classification of RNAs by Analysis of Length (CoRAL), a machine learning-based approach for classification of RNA molecules. CoRAL uses biologically interpretable features including fragment length and cleavage specificity to distinguish between different ncRNA populations. They evaluated CoRAL using genome-wide small RNA sequencing data sets from four human tissue types and were able to classify six different types of RNAs with ∼80% cross-validation accuracy. Analysis by CoRAL revealed that microRNAs, small nucleolar and transposon-derived RNAs are highly discernible and consistent across all human tissue types assessed, whereas long intergenic ncRNAs, small cytoplasmic RNAs and small nuclear RNAs show less consistent patterns. The ability to reliably annotate loci across tissue types demonstrates the potential of CoRAL to characterize ncRNAs using small RNA sequencing data in less well-characterized organisms.

RNA-Seq

Availability – The CoRAL source code, required genome annotation files, and prediction results are available at http://wanglab.pcbi.upenn.edu/coral.

  • Leung YY, Ryvkin P, Ungar LH, Gregory BD, Wang LS. (2013) CoRAL: predicting non-coding RNAs from small RNA-sequencing data. Nucleic Acids Res [Epub ahead of print]. [article]

Graphite web is a novel web tool for pathway analyses and network visualization for gene expression data of both microarray and RNA-seq experiments. Several pathway analyses have been proposed either in the univariate or in the global and multivariate context to tackle the complexity and the interpretation of expression results. These methods can be further divided into ‘topological’ and ‘non-topological’ methods according to their ability to gain power from pathway topology. Biological pathways are, in fact, not only gene lists but can be represented through a network where genes and connections are, respectively, nodes and edges. To this day, the most used approaches are non-topological and univariate although they miss the relationship among genes. On the contrary, topological and multivariate approaches are more powerful, but difficult to be used by researchers without bioinformatic skills.

Here, researchers from the University of Padova, Italy present Graphite web, the first public web server for pathway analysis on gene expression data that combines topological and multivariate pathway analyses with an efficient system of interactive network visualizations for easy results interpretation. Specifically, Graphite web implements five different gene set analyses on three model organisms and two pathway databases.

RNA-Seq

Availability – Graphite Web is freely available at http://graphiteweb.bio.unipd.it/.

Sales G, Calura E, Martini P, Romualdi C. (2013) Graphite Web: web tool for gene set analysis exploiting pathway topology. Nucleic Acids Res [Epub ahead of print]. [article]

Incoming search terms:

  • www rna-seqblog com graphite-web-web-tool-for-gene-set-analysis-exploiting-pathway-topology

Voom: variance modelling at the observation-level

In the past few years, RNA-seq has emerged as a revolutionary new technology for expression profiling. RNA-seq expression data consists of read counts, and many recent publications have argued therefore that RNA-seq data should be analysed by statistical methods designed specifically for counts. Yet all the statistical methods developed for RNA-seq counts rely on approximations of various kinds.

VoomThis article revisits the idea of applying normal-based microarray-like statistical methods to RNA-seq read counts, with the idea that it is more important to model the mean-variance relationship correctly than it is to specify the exact probabilistic distribution of the counts. Log-counts per million are used as expression values. The voom method estimates the mean-variance relationship robustly and generates a precision weight for each individual normalized observation. The normalized log-counts per million and associated precision weights are then entered into the limma analysis pipeline, or indeed into any statistical pipeline for microarray data that is precision weight aware. This opens access for RNA-seq analysts to a large body of methodology developed for microarrays, allowing RNA-seq and microarray data to be analysed in closely comparable ways. The performance of voom and related limma-based pipelines is compared to that of edgeR, DESeq, baySeq, TSPM, PoissonSeq, and DSS. Simulation studies show that voom out-performs previous RNA-seq methods even when the data is generated according to the assumptions of the earlier methods. This is especially true when the sequence depths vary between RNA samples. Several data sets are also analysed to demonstrate how voom can handle heterogeneous data and complex experiments as well as facilitating pathway analysis and gene set testing methods.

(read more…)

Incoming search terms:

  • do we need cufflinks for bacterial genome or directly cuffdiff
  • The RNA-seq Tuxedo pipeline
  • www rna-seqblog com voom-precision-weights-unlock-linear-model-analysis-tools-for-rna-seq-read-counts

With the availability of gene expression data by RNA-seq, powerful statistical approaches for grouping similar gene expression profiles across different environments have become increasingly important. A team led by researchers at Penn State University describe and assess a computational model for clustering genes into distinct groups based on the pattern of gene expression in response to changing environment. The model capitalizes on the Poisson distribution to capture the count property of RNA-seq data. A two-stage hierarchical expectation-maximization (EM) algorithm is implemented to estimate an optimal number of groups and mean expression amounts of each group across two environments. A procedure is formulated to test whether and how a given group shows a plastic response to environmental changes. The impact of gene-environment interactions on the phenotypic plasticity of the organism can also be visualized and characterized. The model was used to analyse an RNA-seq dataset measured from two cell lines of breast cancer that respond differently to an anti-cancer drug, from which genes associated with the resistance and sensitivity of the cell lines are identified. They performed simulation studies to validate the statistical behaviour of the model. The model provides a useful tool for clustering gene expression data by RNA-seq, facilitating understanding of gene functions and networks.

rna-seq

  • Wang N, Wang Y, Hao H, Wang L, Wang Z, Wang J, Wu R. (2013) A bi-Poisson model for clustering gene expression profiles by RNA-seq. Brief Bioinform [Epub ahead of print]. [abstract]

Incoming search terms:

  • rna-seq identified a super-long intergenic transcript functioning in adipoge
  • gene expression heart vertebrate
  • illumina sequencing scriptseq
  • star alignment r rnaseq
  • www rna-seqblog com a-bi-poisson-model-for-clustering-gene-expression-profiles-by-rna-seq

RNA-seq has shown huge potential for phylogenomic inferences in non-model organisms. However, error, incompleteness, and redundant assembled transcripts for each gene in de novo assembly of short reads cause noise in analyses and a large amount of missing data in the aligned matrix. To address these problems, we compare de novo assemblies of paired end 90 bp RNA-seq reads using Oases, Trinity, Trans-ABySS and SOAPdenovo-Trans to transcripts from genome annotation of the model plant Ricinus communis. By doing so we evaluate strategies for optimizing total gene coverage and minimizing assembly chimeras and redundancy.

Researchers at the University of Michigan found that the frequency and structure of chimeras vary dramatically among different software packages. The differences were largely due to the number of trans-self chimeras that contain repeats in the opposite direction. More than half of the total chimeras in Oases and Trinity were trans-self chimeras. Within each package, they found a trade-off between maximizing reference coverage and minimizing redundancy and chimera rate.

In order to reduce redundancy, they investigated three methods: Read more

Incoming search terms:

  • lokus no

The estimation of isoform abundances from RNA-Seq data requires a time-intensive step of mapping reads to either an assembled, or previously annotated transcriptome, followed by an optimization procedure for deconvolution of multi-mapping reads. These procedures are essential for downstream analysis such as differential expression. In cases where it is desirable to adjust the underlying annotation, for example upon the discovery of novel isoforms or errors in existing annotations, current pipelines must be rerun from scratch. This makes it difficult to update abundance estimates after re-annotation, or to explore the effect of changes in the transcriptome on analyses.

Researchers at UC Berkeley have developed a novel efficient algorithm for updating abundance estimates from RNA-Seq experiments upon re-annotation that does not require re-analysis of the entire dataset. Their approach is based on a fast partitioning algorithm for identifying transcripts whose abundances may depend on the added or deleted isoforms, and on a fast follow-up approach to re-estimating abundances for all transcripts. They demonstrate the effectiveness of our methods by showing how to synchronize RNA-Seq abundance estimates with the daily RefSeq incremental updates. Thus, they provide a practical approach to maintaining relevant databases of RNA-Seq derived abundance estimates even as annotations are being constantly revised.

ReXpress

Availability – ReXpress is freely available, together with source code, at http://bio.math.berkeley.edu/ReXpress/

Contact: lpachter@math.berkeley.edu

  • Roberts A, Schaeffer L, Pachter L. (2013) Updating RNA-Seq analyses after re-annotation. Bioinformatics [Epub ahead of print]. [abstract]

Incoming search terms:

  • www rna-seqblog com rexpress-for-updating-abundance-estimates-from-rna-seq-experiments-upon-re-annotation

DNA sequencing technology is becoming more accessible to a variety of researchers as costs continue to decline. As researchers begin to sequence novel transcriptomes, most of these datasets lack a reference genome and will have to rely on de novo assemblers. Making comparisons across assemblies can be difficult: each program has its strengths and weaknesses and no tool exists to comparatively evaluate these datasets.

Now, a team led by researchers at the University of Rhode Island have developed software in R, called Sequence Comparative Analysis using Networks (SCAN) to perform statistical comparisons between distinct assemblies. SCAN uses a reference dataset to identify the most accurate de novo assembly and the ‘good’ transcripts in the user’s data. They tested SCAN on 3 publicly available transcriptomes, each assembled using 3 assembly programs. Moreover, they sequenced the transcriptome of the oomycete Achlya hypogyna and compared de novo assemblies from Velvet, ABySS, and the CLC Genomics Workbench assembly algorithms. One thousand one hundred and twenty eight (1,128) of the CLC transcripts were statistically similar to the reference, compared to 49 of the Velvet transcripts and 937 of the ABySS transcripts. SCAN’s strength is providing statistical support for transcript assemblies in a biological context. However, SCAN is designed to compare distinct node sets in networks, therefore it can also easily be extended to perform statistical comparisons on any network graph regardless of what the nodes represent.

SCAN

Availability – Two versions of SCAN were developed: “SCAN” and “SCAN stringent,” that can run either in single or multiprocessor nodes, and are available from http://evol-net.fr .

  • Misner I, Bicep C, Lopez P, Halary S, Bapteste E, Lane CE. (2013) Sequence Comparative Analysis using Networks (SCAN): software for evaluating de novo transcript assembly from next generation sequencing. Mol Biol Evol [Epub ahead of print]. [abstract]

Incoming search terms:

  • sequence comparative analysis using networks (scan) – software for evaluating de novo transcript assembly from rna-seq data
  • software for evaluating scanned im
  • liang liang@uky edu
  • nugen illumina indexes comparison
  • rna-seq r package plot
  • www rna-seqblog com sequence-comparative-analysis-using-networks-scan-software-for-evaluating-de-novo-transcript-assembly-from-rna-seq-data

Flybasefrom flybase.org

FlyBase is extending its initial gene-level analyses of RNA-seq throughput data from modENCODE and others. The algorithm for RPKM (reads per kilobase per million mapped reads) has been refined, additional datasets have been analyzed, and these data are now available for bulk download.

In order to summarize this type of data at the gene level, it is necessary first to determine a single value for the expression level of each gene for each RNA-seq sample. RNA-seq coverage data are intersected with FlyBase exons, based on the gene model annotations of the current release, to calculate a single value reflecting average coverage per kb per gene. Each gene data point is then classified into one of eight expression level bins, and the graphical and text summaries were produced from the binned values. A more detailed explanation may be found at FBrf0221009.

Bulk data files can be accessed from the Precomputed Data Files page (menu: Files → Current Release). Look in the Genes section; the item line is ‘RNA-Seq RPKM values’. You can download the file directly by clicking here.

Simple and combinatorial queries of RPKM expression data can conducted using the ‘RNA-Seq Search’ option found under the ‘Expression’ tab in the Quick Search tool.

(read more…)

Incoming search terms:

  • cryptic RNA-seq
  • drosophila tophat

NP-hardIsoform reconstruction is a key step in RNA-Seq analysis. Tools such as CEM, iReckon, NSMAP, and MonteBello use maximum likelihood for isoform reconstruction. The maximum likelihood approach has been observed to be computationally expensive. Here, researchers from Tsinghua University, China show that isoform reconstruction using short RNA-Seq reads by maximum likelihood is NP-hard.

  • Li T, Jiang R, Zhang X. (203) Isoform reconstruction using short RNA-Seq reads by maximum likelihood is NP-hard. arXiv:1305.0916 [q-bio.QM]. [article]

Incoming search terms:

  • www rna-seqblog com isoform-reconstruction-using-short-rna-seq-reads-by-maximum-likelihood-is-np-hard
  • cykao@csie ntu edu tw

QualitySNPng is a new software tool developed  at Wageningen University, The Netherlands for the detection and interactive visualization of single-nucleotide polymorphisms (SNPs). It uses a haplotype-based strategy to identify reliable SNPs; it is optimized for the analysis of current RNA-Seq data; but it can also be used on genomic DNA sequences derived from next-generation sequencing experiments. QualitySNPng does not require a sequenced reference genome and delivers reliable SNPs for di- as well as polyploid species. The tool features a user-friendly interface, multiple filtering options to handle typical sequencing errors, support for SAM and ACE files and interactive visualization. QualitySNPng produces high-quality SNP information that can be used directly in genotyping by sequencing approaches for application in QTL and genome-wide association mapping as well as to populate SNP arrays.

QualitySNPng

Availability – The software can be used as a stand-alone application with a graphical user interface or as part of a pipeline system like Galaxy. Versions for Windows, Mac OS X and Linux, as well as the source code, are available from http://www.bioinformatics.nl/QualitySNPng.

  • Nijveen H, van Kaauwen M, Esselink DG, Hoegen B, Vosman B. (2013) QualitySNPng: a user-friendly SNP detection and visualization tool. Nucleic Acids Res [Epub ahead of print]. [article]

Incoming search terms:

  • rna-seq snp
  • PGM data processing pipeline
  • qtl analysis software tools for rna seq
  • RNA-seq online analysis tools

MicroRNAs (miRNAs) are a class of small RNAs that post-transcriptionally regulate gene expression in animals and plants. The recent rapid advancement in miRNA biology, including high-throughput sequencing of small RNA libraries, inspired the development of a bioinformatics software, miRAuto, which predicts putative miRNAs in model plant genomes computationally. Furthermore, miRAuto enables users to identify miRNAs in non-model plant species whose genomes have yet to be fully sequenced. miRAuto analyzes the expression of the 5′-end position of mapped small RNAs in reference sequences to prevent the possibility of mRNA fragments being included as candidate miRNAs.

Researchers at Seoul National University validated the utility of miRAuto on a small RNA dataset, and the results were compared to other publicly available miRNA prediction programs. In conclusion, miRAuto is a fully automated user-friendly tool for predicting miRNAs from small RNA sequencing data in both model and non-model plant species.

miRAuto

Availability – miRAuto is available at http://nature.snu.ac.kr/software/miRAuto.htm .

  • Lee J, Kim DI, Park JH, Choi IY, Shin C. (2013) miRAuto: An automated user-friendly MicroRNA prediction tool utilizing plant small RNA sequencing data. Mol Cells 35(4), 342-7. [abstract]

Incoming search terms:

  • microrna mrna rna-seq
  • miRAuto: An automated user-friendly MicroRNA prediction tool utilizing plant small RNA sequencing data
  • mirna sequencing principle
  • academic library workflow sequence
  • mirna analysis tools
  • mirna sequencing data analysis tool
  • tool to align ENCODE bigwig to reference features

Institute of Genetic Medicine at Johns Hopkins UniversityTopHat, a popular spliced aligner for RNA-seq experiments has now been succeeded by TopHat2, which incorporates many significant enhancements to TopHat. TopHat2 can align reads of various lengths produced by the latest sequencing technologies, while allowing for variable-length indels with respect to the reference genome. In addition to de novo spliced alignment, TopHat2 can align reads across fusion breaks, which occur after genomic translocations. TopHat2 combines the ability to discover novel splice sites with direct mapping to known transcripts, producing sensitive and accurate alignments, even for highly repetitive genomes or in the presence of pseudogenes.

Availability: TopHat2 is available at http://ccb.jhu.edu/software/tophat.

  • Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. (2013) TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol 14(4), R36. [Epub ahead of print]. [abstract]

Incoming search terms:

  • tophat
  • trinity rna seq manual
  • tophat sequence analysis
  • fusionmap sequencing
  • tophat2 pipeline
  • Tophat for Solid
  • tophat DEseq
  • tophat junction then fusion
  • tophat parameters mammalian transcriptomes
  • tophat ppt

The power of deep sequencing technology to reliably detect single RNA reads leads to a paradoxical problem of high sensitivity. In hybridization or PCR based methods for RNA quantification, the concern is low sensitivity, i.e., the problem that the signal from truly expressed genes might not be distinguishable from noise. In contrast, the problem with RNA-seq is that it is not clear whether genes with very low read counts are from low expressed genes or merely transcriptional noise. The frequency distribution for read counts does not show a clear separation in two classes of genes, which makes the decision whether a gene is to be considered expressed or not seemingly arbitrary.

Here, researchers from Yale University address this problem by suggesting a statistical model that considers the number of transcripts detected in a RNA-Seq study as a mixture of two distributions: one is a exponential distribution for transcripts from inactive genes, and a negative binomial distribution for actively transcribed genes. They apply this model to a number of RNA-Seq data sets and find that the model fits the data very well. The calculated criteria for distinguishing between expressed and non-expressed gene is remarkably consistent among data sets, suggesting genes with more than two transcripts per million transcripts (TPM) are highly likely from actively transcribed genes. The regression model correctly identifies the not actively expressed class of genes and thus, provides an operational criterion for classifying genes in expressed and non-expressed sets, facilitating the interpretation of RNA-Seq data.

  •  Wagner GP, Kin K, Lynch VJ. (2013) A model based criterion for gene expression calls using RNA-seq data. Theory Biosci [Epub ahead of print]. [abstract]

Incoming search terms:

  • www rna-seqblog com exponential-negative-binomial-model-for-gene-expression-calls-using-rna-seq-data
  • clustering rna-seq
  • rna-seq for gene expression
  • RNA-seq error have influence on gene expression
  • regulation of gene expresssion in prokaryotes
  • junctions negative binomial
  • edge-pro into deseq
  • edge-pro bacteria rna
  • dispersion matlab
  • deep sequencing rnaseq

Next Page →

  • Social Networking Pages

    Linkedin Group

  • Follow Me on Pinterest
  • RSS SEQanswers – RNA Sequencing

    • DESeq; can I omit timepoints during dispersal estimation? May 24, 2013
      I have a bacterial timecourse with 2 biological replicates per timepoint. There is a fair bit of variance between my replicates. I have spent the... […]
      amcloon
    • HT Seq Count stranded options May 24, 2013
      I am very new to bioinformatics, so I would be really grateful for some help! I have been using *HTSeq Count v0.5.3* and I am bit confused about... […]
      qwrissie
    • Tophat 2.0.8b installation error May 24, 2013
      I install tophat-2.0.8b to rerun the mapping. but when i make it, the error appears like this. make[1]: Entering directory... […]
      canhu
    • reason for low mapping rate?? May 23, 2013
      we did RNASeq using HiSeq 2000 100PE. When the data were back, I mapping them to the reference sequence, but got very low mapping rate (30-40%). I... […]
      miaom
    • cross-species data - questions about normalization May 23, 2013
      Hi, I have some data form various samples (cell types) in different species. I want to compare and analyze gene expression variability across the... […]
      trelek2
    • CuffDiff strange output May 23, 2013
      Hi, I hope that someone can be so gentle to help me. I'm analizing some data from RNA-Seq with TopHat and Cufflinks and I focus my attention on... […]
      Pruexel
  • RSS Biostar – RNA-Seq

    • Why am I getting so many unmapped reads in STAR, classified as "too short"?
      I am currently using STAR to map several Hi-SEQ mRNA runs. I'm having trouble getting a decent amount of reads to map, but I don't really understand why. I'm hoping you can shed some light :) In the final log, only about 50% (or less) of the reads map to the reference. I'm using a GTF in addition to the genome. The unmapped bin that most […]
    • What are the best practices for SNP identification in RNA seq transcriptome data
      I have 20 RICE RNA seq tranascriptome data hiseq 2000 platform paired end reads. I aligned fasta reads with BWA and remove PCR duplicates with PICARD. Later I call SNP with samtools using various parameters. I would like to clarify what parameters should I used while alinging to reference rice genome for looking SNP location 100 bp upstream and 250 bp downst […]
    • How do TopHat options -g , --supress-hits, and Bowtie options interplay?
      Hi, I am currently using TopHat2 to map RNA-seq runs. I think there have been some changes pertaining the -g option. Does anyone know how it works now? I used to think that setting -g would look for n alignments for a given read, report them [if top-scoring] and discard those reads that had more than g [top scoring] alignments. Now, the description sounds mo […]
    • What happened to -k in TopHat for multiple-mapping reads?
      Selecting -g n in tophat does not discard reads mapping more than n, but instead only reports n alignments for those out all all their TOP scoring alignments. I think there used to be an option -k that would allow one to discard reads that topped x alignments -- whatever happened to that? I only see -g in the tophat 2 manual, no reporting options like before […]
    • Does tophat use the library-type information for mapping, or just for the XS flag?
      When I specify library-type to TopHat, i.e., first-strand, second-strand, unstranded, TopHat appends a value + or - to the XS:A flag, which is useful for subsequent analyses, such as annotation. However, does this information actually influence the "mappability" of reads, or is this unaffected? My thinking is that the information would be considere […]
    • Purpose of Y-shaped adapters in Illumina Sequencing?
      Hi all, Y adapters different sequences to be annealed to the 5' and 3' ends of each molecule in a library. The arms of the Y are unique, and the middle part, connected to the DNA fragment, is complementary. What are the advantages of this? My take of this over having fully-complementary adapters (ADAPTER1 - - - - - ADAPTER1) is that: -Upon primer a […]