CAGEfightR – analysis of 5′-end data using R/Bioconductor

5′-end sequencing assays, and Cap Analysis of Gene Expression (CAGE) in particular, have been instrumental in studying transcriptional regulation. 5′-end methods provide genome-wide maps of transcription start sites (TSSs) with base pair resolution. Because active enhancers often feature bidirectional TSSs, such data can also be used to predict enhancer candidates. The current availability of mature and comprehensive computational tools for the analysis of 5′-end data is limited, preventing efficient analysis of new and existing 5′-end data.

University of Copenhagen researchers present CAGEfightR, a framework for analysis of CAGE and other 5′-end data implemented as an R/Bioconductor-package. CAGEfightR can import data from BigWig files and allows for fast and memory efficient prediction and analysis of TSSs and enhancers. Downstream analyses include quantification, normalization, annotation with transcript and gene models, TSS shape statistics, linking TSSs to enhancers via co-expression, identification of enhancer clusters, and genome-browser style visualization. While built to analyze CAGE data, the researchers demonstrate the utility of CAGEfightR in analyzing nascent RNA 5′-data (PRO-Cap). CAGEfightR is implemented using standard Bioconductor classes, making it easy to learn, use and combine with other Bioconductor packages, for example popular differential expression tools such as limma, DESeq2 and edgeR.

Introduction to CAGEfightR


a: Overview of CAGEfightR analysis steps: CAGEfightR can import CTSSs (the number of tag 5′-ends mapping to each bp position) and calculate a pooled CTSSs signal across all samples (top). The pooled CTSSs signal on the same strand can be used to identify unidirectional or Tag Clusters (TCs) which corresponds to groups of nearby TSSs or bidirectional clusters (BCs) which are candidate enhancers (middle). TCs can furthermore be assigned to genes using annotated gene models and summed to provide an estimate of gene expression (bottom). Each of these levels of analysis is associated with an expression matrix (right). The names of used CAGEfightR functions for respective analyses are highlighted. b: Example of unidirectional clustering. The bottom track shows the pooled CTSS signal (pooled TPM) at each bp along the genome. Middle track shows a Tag Cluster (TC) based on the CTSS data below as a block, where the position with the highest pooled CTSS signal is indicated (TC peak). The top track shows UCSC transcripts models (lines/thin blocks/thick blocks are intronic/UTR/CDS regions, respectively). c: Example of bidirectional clustering to predict enhancers. Bottom track shows pooled CTSS signal as in panel B, but with signal on both strands (red, negative bars indicate minus strand and blue, positive values indicate plus strand). The middle track shows the balance score (Bhattacharyya coefficient, Additional file 1 :Figure S1A) calculated along the genomic region. Top track shows the resulting Bidirectional Cluster (BC) as a block in pink indicating lack of strand information, where the single bp with the highest balance score is indicated

CAGEfightR provides a single, scalable and easy-to-use framework for comprehensive downstream analysis of 5′-end data. CAGEfightR is designed to be interoperable with other Bioconductor packages, thereby unlocking hundreds of mature transcriptomic analysis tools for 5′-end data.

Availability – CAGEfightR is freely available via Bioconductor: .

Thodberg M, Thieffry A, Vitting-Seerup K, Andersson R, Sandelin A. (2019) CAGEfightR: analysis of 5′-end data using R/Bioconductor. BMC Bioinformatics 20(1):487. [article]

Leave a Reply

Your email address will not be published. Required fields are marked *


Time limit is exhausted. Please reload CAPTCHA.