May
24
CoRAL: predicting non-coding RNAs from small RNA-sequencing data
Filed Under Analysis Pipelines, Annotation, Other Tools | Leave a Comment
The surprising observation that virtually the entire human genome is transcribed means we know little about the function of many emerging classes of RNAs, except their astounding diversities. Traditional RNA function prediction methods rely on sequence or alignment information, which are limited in their abilities to classify the various collections of non-coding RNAs (ncRNAs). To address this, researchers from the University of Pennsylvania developed Classification of RNAs by Analysis of Length (CoRAL), a machine learning-based approach for classification of RNA molecules. CoRAL uses biologically interpretable features including fragment length and cleavage specificity to distinguish between different ncRNA populations. They evaluated CoRAL using genome-wide small RNA sequencing data sets from four human tissue types and were able to classify six different types of RNAs with ∼80% cross-validation accuracy. Analysis by CoRAL revealed that microRNAs, small nucleolar and transposon-derived RNAs are highly discernible and consistent across all human tissue types assessed, whereas long intergenic ncRNAs, small cytoplasmic RNAs and small nuclear RNAs show less consistent patterns. The ability to reliably annotate loci across tissue types demonstrates the potential of CoRAL to characterize ncRNAs using small RNA sequencing data in less well-characterized organisms.
Availability – The CoRAL source code, required genome annotation files, and prediction results are available at http://wanglab.pcbi.upenn.edu/coral.
- Leung YY, Ryvkin P, Ungar LH, Gregory BD, Wang LS. (2013) CoRAL: predicting non-coding RNAs from small RNA-sequencing data. Nucleic Acids Res [Epub ahead of print]. [article]
May
22
Graphite Web: web tool for gene set analysis exploiting pathway topology
Filed Under Pathway Analysis | Leave a Comment
Graphite web is a novel web tool for pathway analyses and network visualization for gene expression data of both microarray and RNA-seq experiments. Several pathway analyses have been proposed either in the univariate or in the global and multivariate context to tackle the complexity and the interpretation of expression results. These methods can be further divided into ‘topological’ and ‘non-topological’ methods according to their ability to gain power from pathway topology. Biological pathways are, in fact, not only gene lists but can be represented through a network where genes and connections are, respectively, nodes and edges. To this day, the most used approaches are non-topological and univariate although they miss the relationship among genes. On the contrary, topological and multivariate approaches are more powerful, but difficult to be used by researchers without bioinformatic skills.
Here, researchers from the University of Padova, Italy present Graphite web, the first public web server for pathway analysis on gene expression data that combines topological and multivariate pathway analyses with an efficient system of interactive network visualizations for easy results interpretation. Specifically, Graphite web implements five different gene set analyses on three model organisms and two pathway databases.

Availability – Graphite Web is freely available at http://graphiteweb.bio.unipd.it/.
Sales G, Calura E, Martini P, Romualdi C. (2013) Graphite Web: web tool for gene set analysis exploiting pathway topology. Nucleic Acids Res [Epub ahead of print]. [article]
Incoming search terms:
- www rna-seqblog com graphite-web-web-tool-for-gene-set-analysis-exploiting-pathway-topology
May
21
Voom! Precision weights unlock linear model analysis tools for RNA-Seq read counts
Filed Under Analysis Pipelines, Expression and Quantification | Leave a Comment
Voom: variance modelling at the observation-level
In the past few years, RNA-seq has emerged as a revolutionary new technology for expression profiling. RNA-seq expression data consists of read counts, and many recent publications have argued therefore that RNA-seq data should be analysed by statistical methods designed specifically for counts. Yet all the statistical methods developed for RNA-seq counts rely on approximations of various kinds.
This article revisits the idea of applying normal-based microarray-like statistical methods to RNA-seq read counts, with the idea that it is more important to model the mean-variance relationship correctly than it is to specify the exact probabilistic distribution of the counts. Log-counts per million are used as expression values. The voom method estimates the mean-variance relationship robustly and generates a precision weight for each individual normalized observation. The normalized log-counts per million and associated precision weights are then entered into the limma analysis pipeline, or indeed into any statistical pipeline for microarray data that is precision weight aware. This opens access for RNA-seq analysts to a large body of methodology developed for microarrays, allowing RNA-seq and microarray data to be analysed in closely comparable ways. The performance of voom and related limma-based pipelines is compared to that of edgeR, DESeq, baySeq, TSPM, PoissonSeq, and DSS. Simulation studies show that voom out-performs previous RNA-seq methods even when the data is generated according to the assumptions of the earlier methods. This is especially true when the sequence depths vary between RNA samples. Several data sets are also analysed to demonstrate how voom can handle heterogeneous data and complex experiments as well as facilitating pathway analysis and gene set testing methods.
Incoming search terms:
- do we need cufflinks for bacterial genome or directly cuffdiff
- The RNA-seq Tuxedo pipeline
- www rna-seqblog com voom-precision-weights-unlock-linear-model-analysis-tools-for-rna-seq-read-counts
May
20
A bi-Poisson model for clustering gene expression profiles by RNA-seq
Filed Under Analysis Pipelines, Expression and Quantification | Leave a Comment
With the availability of gene expression data by RNA-seq, powerful statistical approaches for grouping similar gene expression profiles across different environments have become increasingly important. A team led by researchers at Penn State University describe and assess a computational model for clustering genes into distinct groups based on the pattern of gene expression in response to changing environment. The model capitalizes on the Poisson distribution to capture the count property of RNA-seq data. A two-stage hierarchical expectation-maximization (EM) algorithm is implemented to estimate an optimal number of groups and mean expression amounts of each group across two environments. A procedure is formulated to test whether and how a given group shows a plastic response to environmental changes. The impact of gene-environment interactions on the phenotypic plasticity of the organism can also be visualized and characterized. The model was used to analyse an RNA-seq dataset measured from two cell lines of breast cancer that respond differently to an anti-cancer drug, from which genes associated with the resistance and sensitivity of the cell lines are identified. They performed simulation studies to validate the statistical behaviour of the model. The model provides a useful tool for clustering gene expression data by RNA-seq, facilitating understanding of gene functions and networks.

- Wang N, Wang Y, Hao H, Wang L, Wang Z, Wang J, Wu R. (2013) A bi-Poisson model for clustering gene expression profiles by RNA-seq. Brief Bioinform [Epub ahead of print]. [abstract]
Incoming search terms:
- rna-seq identified a super-long intergenic transcript functioning in adipoge
- gene expression heart vertebrate
- illumina sequencing scriptseq
- star alignment r rnaseq
- www rna-seqblog com a-bi-poisson-model-for-clustering-gene-expression-profiles-by-rna-seq
May
20
Optimizing de novo assembly of short-read RNA-seq data for phylogenomics
Filed Under Analysis Pipelines, Other Tools | Leave a Comment
RNA-seq has shown huge potential for phylogenomic inferences in non-model organisms. However, error, incompleteness, and redundant assembled transcripts for each gene in de novo assembly of short reads cause noise in analyses and a large amount of missing data in the aligned matrix. To address these problems, we compare de novo assemblies of paired end 90 bp RNA-seq reads using Oases, Trinity, Trans-ABySS and SOAPdenovo-Trans to transcripts from genome annotation of the model plant Ricinus communis. By doing so we evaluate strategies for optimizing total gene coverage and minimizing assembly chimeras and redundancy.
Researchers at the University of Michigan found that the frequency and structure of chimeras vary dramatically among different software packages. The differences were largely due to the number of trans-self chimeras that contain repeats in the opposite direction. More than half of the total chimeras in Oases and Trinity were trans-self chimeras. Within each package, they found a trade-off between maximizing reference coverage and minimizing redundancy and chimera rate.
In order to reduce redundancy, they investigated three methods: Read more
Incoming search terms:
- lokus no
May
16
ReXpress – for updating abundance estimates from RNA-Seq experiments upon re-annotation
Filed Under Expression and Quantification, Other Tools | Leave a Comment
The estimation of isoform abundances from RNA-Seq data requires a time-intensive step of mapping reads to either an assembled, or previously annotated transcriptome, followed by an optimization procedure for deconvolution of multi-mapping reads. These procedures are essential for downstream analysis such as differential expression. In cases where it is desirable to adjust the underlying annotation, for example upon the discovery of novel isoforms or errors in existing annotations, current pipelines must be rerun from scratch. This makes it difficult to update abundance estimates after re-annotation, or to explore the effect of changes in the transcriptome on analyses.
Researchers at UC Berkeley have developed a novel efficient algorithm for updating abundance estimates from RNA-Seq experiments upon re-annotation that does not require re-analysis of the entire dataset. Their approach is based on a fast partitioning algorithm for identifying transcripts whose abundances may depend on the added or deleted isoforms, and on a fast follow-up approach to re-estimating abundances for all transcripts. They demonstrate the effectiveness of our methods by showing how to synchronize RNA-Seq abundance estimates with the daily RefSeq incremental updates. Thus, they provide a practical approach to maintaining relevant databases of RNA-Seq derived abundance estimates even as annotations are being constantly revised.

Availability – ReXpress is freely available, together with source code, at http://bio.math.berkeley.edu/ReXpress/
Contact: lpachter@math.berkeley.edu
- Roberts A, Schaeffer L, Pachter L. (2013) Updating RNA-Seq analyses after re-annotation. Bioinformatics [Epub ahead of print]. [abstract]
Incoming search terms:
- www rna-seqblog com rexpress-for-updating-abundance-estimates-from-rna-seq-experiments-upon-re-annotation
May
15
Sequence Comparative Analysis using Networks (SCAN) – software for evaluating de novo transcript assembly from RNA-Seq Data
Filed Under Transcriptome Assembly Tools | Leave a Comment
DNA sequencing technology is becoming more accessible to a variety of researchers as costs continue to decline. As researchers begin to sequence novel transcriptomes, most of these datasets lack a reference genome and will have to rely on de novo assemblers. Making comparisons across assemblies can be difficult: each program has its strengths and weaknesses and no tool exists to comparatively evaluate these datasets.
Now, a team led by researchers at the University of Rhode Island have developed software in R, called Sequence Comparative Analysis using Networks (SCAN) to perform statistical comparisons between distinct assemblies. SCAN uses a reference dataset to identify the most accurate de novo assembly and the ‘good’ transcripts in the user’s data. They tested SCAN on 3 publicly available transcriptomes, each assembled using 3 assembly programs. Moreover, they sequenced the transcriptome of the oomycete Achlya hypogyna and compared de novo assemblies from Velvet, ABySS, and the CLC Genomics Workbench assembly algorithms. One thousand one hundred and twenty eight (1,128) of the CLC transcripts were statistically similar to the reference, compared to 49 of the Velvet transcripts and 937 of the ABySS transcripts. SCAN’s strength is providing statistical support for transcript assemblies in a biological context. However, SCAN is designed to compare distinct node sets in networks, therefore it can also easily be extended to perform statistical comparisons on any network graph regardless of what the nodes represent.

Availability – Two versions of SCAN were developed: “SCAN” and “SCAN stringent,” that can run either in single or multiprocessor nodes, and are available from http://evol-net.fr .
- Misner I, Bicep C, Lopez P, Halary S, Bapteste E, Lane CE. (2013) Sequence Comparative Analysis using Networks (SCAN): software for evaluating de novo transcript assembly from next generation sequencing. Mol Biol Evol [Epub ahead of print]. [abstract]
Incoming search terms:
- sequence comparative analysis using networks (scan) – software for evaluating de novo transcript assembly from rna-seq data
- software for evaluating scanned im
- liang liang@uky edu
- nugen illumina indexes comparison
- rna-seq r package plot
- www rna-seqblog com sequence-comparative-analysis-using-networks-scan-software-for-evaluating-de-novo-transcript-assembly-from-rna-seq-data
May
13
FlyBase RNA-Seq RPKM data calculations available for bulk download
Filed Under Analysis Pipelines | Leave a Comment
from flybase.org
FlyBase is extending its initial gene-level analyses of RNA-seq throughput data from modENCODE and others. The algorithm for RPKM (reads per kilobase per million mapped reads) has been refined, additional datasets have been analyzed, and these data are now available for bulk download.
In order to summarize this type of data at the gene level, it is necessary first to determine a single value for the expression level of each gene for each RNA-seq sample. RNA-seq coverage data are intersected with FlyBase exons, based on the gene model annotations of the current release, to calculate a single value reflecting average coverage per kb per gene. Each gene data point is then classified into one of eight expression level bins, and the graphical and text summaries were produced from the binned values. A more detailed explanation may be found at FBrf0221009.
Bulk data files can be accessed from the Precomputed Data Files page (menu: Files → Current Release). Look in the Genes section; the item line is ‘RNA-Seq RPKM values’. You can download the file directly by clicking here.
Simple and combinatorial queries of RPKM expression data can conducted using the ‘RNA-Seq Search’ option found under the ‘Expression’ tab in the Quick Search tool.
Incoming search terms:
- cryptic RNA-seq
- drosophila tophat
May
10
Isoform reconstruction using short RNA-Seq reads by maximum likelihood is NP-hard
Filed Under Transcriptome Assembly Tools | Leave a Comment
Isoform reconstruction is a key step in RNA-Seq analysis. Tools such as CEM, iReckon, NSMAP, and MonteBello use maximum likelihood for isoform reconstruction. The maximum likelihood approach has been observed to be computationally expensive. Here, researchers from Tsinghua University, China show that isoform reconstruction using short RNA-Seq reads by maximum likelihood is NP-hard.
- Li T, Jiang R, Zhang X. (203) Isoform reconstruction using short RNA-Seq reads by maximum likelihood is NP-hard. arXiv:1305.0916 [q-bio.QM]. [article]
Incoming search terms:
- www rna-seqblog com isoform-reconstruction-using-short-rna-seq-reads-by-maximum-likelihood-is-np-hard
- cykao@csie ntu edu tw
May
3
QualitySNPng – for the detection and interactive visualization of SNPs without a sequenced reference genome
Filed Under SNP Detection | Leave a Comment
QualitySNPng is a new software tool developed at Wageningen University, The Netherlands for the detection and interactive visualization of single-nucleotide polymorphisms (SNPs). It uses a haplotype-based strategy to identify reliable SNPs; it is optimized for the analysis of current RNA-Seq data; but it can also be used on genomic DNA sequences derived from next-generation sequencing experiments. QualitySNPng does not require a sequenced reference genome and delivers reliable SNPs for di- as well as polyploid species. The tool features a user-friendly interface, multiple filtering options to handle typical sequencing errors, support for SAM and ACE files and interactive visualization. QualitySNPng produces high-quality SNP information that can be used directly in genotyping by sequencing approaches for application in QTL and genome-wide association mapping as well as to populate SNP arrays.
Availability – The software can be used as a stand-alone application with a graphical user interface or as part of a pipeline system like Galaxy. Versions for Windows, Mac OS X and Linux, as well as the source code, are available from http://www.bioinformatics.nl/QualitySNPng.
- Nijveen H, van Kaauwen M, Esselink DG, Hoegen B, Vosman B. (2013) QualitySNPng: a user-friendly SNP detection and visualization tool. Nucleic Acids Res [Epub ahead of print]. [article]
Incoming search terms:
- rna-seq snp
- PGM data processing pipeline
- qtl analysis software tools for rna seq
- RNA-seq online analysis tools
Apr
30
miRAuto: An automated user-friendly MicroRNA prediction tool utilizing plant small RNA sequencing data
Filed Under Other Tools | Leave a Comment
MicroRNAs (miRNAs) are a class of small RNAs that post-transcriptionally regulate gene expression in animals and plants. The recent rapid advancement in miRNA biology, including high-throughput sequencing of small RNA libraries, inspired the development of a bioinformatics software, miRAuto, which predicts putative miRNAs in model plant genomes computationally. Furthermore, miRAuto enables users to identify miRNAs in non-model plant species whose genomes have yet to be fully sequenced. miRAuto analyzes the expression of the 5′-end position of mapped small RNAs in reference sequences to prevent the possibility of mRNA fragments being included as candidate miRNAs.
Researchers at Seoul National University validated the utility of miRAuto on a small RNA dataset, and the results were compared to other publicly available miRNA prediction programs. In conclusion, miRAuto is a fully automated user-friendly tool for predicting miRNAs from small RNA sequencing data in both model and non-model plant species.
Availability – miRAuto is available at http://nature.snu.ac.kr/software/miRAuto.htm .
- Lee J, Kim DI, Park JH, Choi IY, Shin C. (2013) miRAuto: An automated user-friendly MicroRNA prediction tool utilizing plant small RNA sequencing data. Mol Cells 35(4), 342-7. [abstract]
Incoming search terms:
- microrna mrna rna-seq
- miRAuto: An automated user-friendly MicroRNA prediction tool utilizing plant small RNA sequencing data
- mirna sequencing principle
- academic library workflow sequence
- mirna analysis tools
- mirna sequencing data analysis tool
- tool to align ENCODE bigwig to reference features
Apr
29
TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions
Filed Under Splicing and Junction Mapping, Transcriptome Assembly Tools | Leave a Comment
TopHat, a popular spliced aligner for RNA-seq experiments has now been succeeded by TopHat2, which incorporates many significant enhancements to TopHat. TopHat2 can align reads of various lengths produced by the latest sequencing technologies, while allowing for variable-length indels with respect to the reference genome. In addition to de novo spliced alignment, TopHat2 can align reads across fusion breaks, which occur after genomic translocations. TopHat2 combines the ability to discover novel splice sites with direct mapping to known transcripts, producing sensitive and accurate alignments, even for highly repetitive genomes or in the presence of pseudogenes.
Availability: TopHat2 is available at http://ccb.jhu.edu/software/tophat.
- Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. (2013) TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol 14(4), R36. [Epub ahead of print]. [abstract]
Incoming search terms:
- tophat
- trinity rna seq manual
- tophat sequence analysis
- fusionmap sequencing
- tophat2 pipeline
- Tophat for Solid
- tophat DEseq
- tophat junction then fusion
- tophat parameters mammalian transcriptomes
- tophat ppt
Apr
26
Exponential-negative-Binomial model – for gene expression calls using RNA-Seq data
Filed Under Expression and Quantification | Leave a Comment
The power of deep sequencing technology to reliably detect single RNA reads leads to a paradoxical problem of high sensitivity. In hybridization or PCR based methods for RNA quantification, the concern is low sensitivity, i.e., the problem that the signal from truly expressed genes might not be distinguishable from noise. In contrast, the problem with RNA-seq is that it is not clear whether genes with very low read counts are from low expressed genes or merely transcriptional noise. The frequency distribution for read counts does not show a clear separation in two classes of genes, which makes the decision whether a gene is to be considered expressed or not seemingly arbitrary.
Here, researchers from Yale University address this problem by suggesting a statistical model that considers the number of transcripts detected in a RNA-Seq study as a mixture of two distributions: one is a exponential distribution for transcripts from inactive genes, and a negative binomial distribution for actively transcribed genes. They apply this model to a number of RNA-Seq data sets and find that the model fits the data very well. The calculated criteria for distinguishing between expressed and non-expressed gene is remarkably consistent among data sets, suggesting genes with more than two transcripts per million transcripts (TPM) are highly likely from actively transcribed genes. The regression model correctly identifies the not actively expressed class of genes and thus, provides an operational criterion for classifying genes in expressed and non-expressed sets, facilitating the interpretation of RNA-Seq data.
- Wagner GP, Kin K, Lynch VJ. (2013) A model based criterion for gene expression calls using RNA-seq data. Theory Biosci [Epub ahead of print]. [abstract]
Incoming search terms:
- www rna-seqblog com exponential-negative-binomial-model-for-gene-expression-calls-using-rna-seq-data
- clustering rna-seq
- rna-seq for gene expression
- RNA-seq error have influence on gene expression
- regulation of gene expresssion in prokaryotes
- junctions negative binomial
- edge-pro into deseq
- edge-pro bacteria rna
- dispersion matlab
- deep sequencing rnaseq


.png)











