Quality control checks are the first step in RNA-Sequencing analysis, which enable the identification of common issues that occur in the sequenced reads. Checks for sequence quality, contamination, and complexity are commonplace, and allow users to implement steps downstream which can account for these issues. Strand-specificity of reads is frequently overlooked and is often unavailable even in published data, yet when unknown or incorrectly specified can have detrimental effects on the reproducibility and accuracy of downstream analyses.
To address these issues, researchers at the University of Tasmania and the University of Technology Sydney developed how_are_we_stranded_here, a Python library that helps to quickly infer strandedness of paired-end RNA-Sequencing data. Testing on both simulated and real RNA-Sequencing reads showed that it correctly measures strandedness, and measures outside the normal range may indicate sample contamination.
Strandedness proportions in RNA-Seq data
Strandedness proportions were evaluated for 20 studies for each h. sapiens, s. cerevisiae, and a. thaliana using how_are_we_stranded_here and varying the number of input reads sampled. Results are not included where zero reads were psuedoaligned, and triangles denote results where the proportion of reads psuedoaligned is less than 0.1. Studies for which the strandedness proportion was between 0.6 and 0.9 (dashed lines), and those which do not match the reported strandedness are highlighted. how_are_we_stranded_here was run using the full Ensembl cDNA annotation for each species
Availability – how_are_we_stranded_here is freely available at https://github.com/betsig/how_are_we_stranded_here.