RNA-Seq is currently used routinely, and it provides accurate information on gene transcription. However, the method cannot accurately estimate duplicated genes expression. Several strategies have been previously used (drop duplicated genes, distribute uniformly the reads, or estimate expression), but all of them provide biased results.
Researchers from MIAT INRA provide here a tool, called mmquant, for computing gene expression, included duplicated genes. If a read maps at different positions, the tool detects that the corresponding genes are duplicated; it merges the genes and creates a merged gene. The counts of ambiguous reads is then based on the input genes and the merged genes.
Ambiguous reads resolution
a: The read is included in both genes A and B. Here, the resolution cannot be solved, and the read will be attributed to gene A–B. b: The read is not totally included in gene A, neither in gene B. n Anucleotides of the read overlap with gene A, and n B overlap with gene B, and n A >n B . If n A ≫n B , we may attribute the read to gene A only. However, if n A ≈n B , the ambiguity cannot be resolved, and the read is attributed to A–B. The two following cases show the rules to resolve ambiguity. c: We suppose here that n A >n B +N, where N is a parameter set by the user (default: 30). In this case, mmquant will attribute the read to gene A only. d: We suppose here that n A >n B ×P, where P is given by the user (default: 2). The read will be attributed uniquely to gene A. e: Here, the single end read contains an intron. Exon-wise, the read can be attributed to gene A or B. In case of ambiguity, introns are compared. The intron of the read matches the intron of gene A, whereas gene B has no intron there. The read is thus attributed to A. f: The read is ambiguous exon-wise. We compute nA′ and nB′, the number of nucleotides shared by the intron of the read and the introns of genes A and B respectively. Ambiguity is solved using nA′, nB′ and the rules given in c and d. g: The read is paired-end. In case of ambiguity, n A and n B are computed as the sums of the overlapping bases between the two reads and the gene A and B respectively. The rules presented in c and d apply next
Availability – Project home page: https://bitbucket.org/mzytnicki/multi-mapping-counter