Detecting sources of bias in transcriptomic data is essential to determine signals of Biological significance. Researchers from the University of London outline a novel method to detect sequence specific bias in short read Next Generation Sequencing data. This is based on determining intra-exon correlations between specific motifs. This requires a mild assumption that short reads sampled from specific regions from the same exon will be correlated with each other. This has been implemented on Apache Spark and used to analyse two D. melanogaster eye-antennal disc data sets generated at the same laboratory. The wild type data set in drosophila indicates a variation due to motif GC content that is more significant than that found due to exon GC content. The software is available online and could be applied for cross-experiment transcriptome data analysis in eukaryotes.
Overview of method for quantifying sequence-specific deviations in read distribution
Phase I, the distributed phase, comprises 3 map steps and a reduce step on Apache Spark, with intermediate data being stored on HDFS. Phase II, the non-distributed phase, counts analysis phase utilises raw motif count and position data generated by phase I, which has been stored on the local file system.
Availability – All of the software developed for this paper is available on request or can be downloaded directly from https://doi.org/10.5281/zenodo.801378.