MGcount – a total RNA-seq quantification tool to address multi-mapping and multi-overlapping alignments ambiguity in non-coding transcripts

Total-RNA sequencing (total-RNA-seq) allows the simultaneous study of both the coding and the non-coding transcriptome. Yet, computational pipelines have traditionally focused on particular biotypes, making assumptions that are not fullfilled by total-RNA-seq datasets. Transcripts from distinct RNA biotypes vary in length, biogenesis, and function, can overlap in a genomic region, and may be present in the genome with a high copy number. Consequently, reads from total-RNA-seq libraries may cause ambiguous genomic alignments, demanding for flexible quantification approaches.

Researchers at Diagenode have developed Multi-Graph count (MGcount), a total-RNA-seq quantification tool combining two strategies for handling ambiguous alignments. First, MGcount assigns reads hierarchically to small-RNA and long-RNA features to account for length disparity when transcripts overlap in the same genomic position. Next, MGcount aggregates RNA products with similar sequences where reads systematically multi-map using a graph-based approach. MGcount outputs a transcriptomic count matrix compatible with RNA-sequencing downstream analysis pipelines, with both bulk and single-cell resolution, and the graphs that model repeated transcript structures for different biotypes. The software can be used as a python module or as a single-file executable program.

MGcount strategy

Fig. 2

a MGcount takes a set of genomic alignments (BAM files) and a GTF RNA feature annotations file as inputs. The algorithm assigns reads hierarchically and then models multi-mapping assignments in a graph using the Rosvall’s map equation. As output, MGcount provides an RNA expression count matrix (where feature communities are collapsed as new defined features), a feature metadata table and the graphs. b Illustration of how the hierarchical assignation can resolve multi-overlappers: reads that map to small-RNA and long-RNA features are assigned to small-RNA in the first round; reads that map to long-RNA introns and long-RNA exons are assigned to long-RNA exons in the second round; remaining reads are assigned in the last round. c Illustration of multi-mapping small-RNA and long-RNA exon graphs generation by MGcount. Reads ri (i = 1, 10) have been hierarchically assigned to S1,S2,S3,S4,S5S1,S2,S3,S4,S5 (small-RNA biotypes, yellow), and G1,G2G1,G2 (long-RNA biotypes, blue). Each vertex in the directional multi-mapping graphs (right) corresponds to a feature and has a size proportional to the logarithm of the number of alignments. Edges connect vertices with common multi-mapping reads, with weights proportional to the number of common multi-mappers normalized by the total number of alignments of the source vertex. Hence, the weight of the edge connecting S1 with S2 becomes 3/4 (reads mapping both S1 and S2 divided by reads aligned to S1). (CB: Cell Barcode, UMI: Unique Molecular Identifier)

MGcount is a flexible total-RNA-seq quantification tool that successfully integrates reads that align to multiple genomic locations or that overlap with multiple gene features. Its approach is suitable for the simultaneous estimation of protein-coding, long non-coding and small non-coding transcript concentration, in both precursor and processed forms.

Availability – Both source code and compiled software are available at https://github.com/hitaandrea/MGcount.

Hita A, Brocart G, Fernandez A, Rehmsmeier M, Alemany A, Schvartzman S. (2022) MGcount: a total RNA-seq quantification tool to address multi-mapping and multi-overlapping alignments ambiguity in non-coding transcripts. BMC Bioinformatics 23(1):39. [article]

Leave a Reply

Your email address will not be published. Required fields are marked *

*

Time limit is exhausted. Please reload CAPTCHA.