Researchers improve the annotation of disease-relevant genes in RNA-sequencing data

Growing evidence suggests that human gene annotation remains incomplete; however, it is unclear how this affects different tissues and our understanding of different disorders. A team led by researchers at University College London detect previously unannotated transcription from Genotype-Tissue Expression RNA sequencing data across 41 human tissues. They connected this unannotated transcription to known genes, confirming that human gene annotation remains incomplete, even among well-studied genes including 63% of the Online Mendelian Inheritance in Man-morbid catalog and 317 neurodegeneration-associated genes. The researchers found the greatest abundance of unannotated transcription in brain and genes highly expressed in brain are more likely to be reannotated. They explore examples of reannotated disease genes, such as SNCA, for which they experimentally validate a previously unidentified, brain-specific, potentially protein-coding exon. The researchers anticipate that this resource will facilitate more accurate genetic analysis, with the greatest impact on our understanding of Mendelian and complex neurogenetic disorders.

Optimization of the detection of transcription

(A) Transcription in the form ERs was detected in an annotation-agnostic manner across 41 human tissues. The MCC is the number of reads supporting each base above which that base would be considered transcribed, and the MRG is the maximum number of bases between ERs below which adjacent ERs would be merged. MCC and MRG parameters were optimized for each tissue using the nonoverlapping exons from Ensembl v92 reference annotation. (B) Line plot illustrating the selection of the MCC and MRG that minimized the difference between ER and exon definitions (median exon delta). (C) Line plot illustrating the selection of the MCC and MRG that maximized the number of ERs that precisely matched exon definitions (exon delta = 0). The cerebellum tissue is plotted for (B) and (C), which is representative of the other GTEx tissues. Green and red lines indicate the optimal MCC (2.6) and MRG (70), respectively.

Availability – Code used to perform analyses in this study is publicly available via the https://github.com/dzhang32/ER_paper_2019_supp_code. All tissue-specific transcriptomes are available through vizER: http://rytenlab.com/browser/app/vizER.

Zhang D, Guelfi S, Garcia-Ruiz S, Costa B, Reynolds RH, D’Sa K, Liu W, Courtin T, Peterson A, Jaffe AE, Hardy J, Botía JA, Collado-Torres L, Ryten M. (2020) Incomplete annotation has a disproportionate impact on our understanding of Mendelian and complex neurogenetic disorders. Sci Adv 6(24): eaay8299. [article]

Leave a Reply

Your email address will not be published. Required fields are marked *

*

Time limit is exhausted. Please reload CAPTCHA.