Proteogenomic database construction driven from large scale RNA-seq data

The advent of inexpensive RNA-Seq technologies and other deep sequencing technologies for RNA has the promise to radically improve genomic annotation, providing information on transcribed regions and splicing events in a variety of cellular conditions. Using MS based proteogenomics, many of these events can be confirmed directly at the protein level. However, the integration of large amounts of redundant RNA-seq data and mass spectrometry data poses a challenging problem. Our manuscript addresses this by construction of a compact database that contains all useful information expressed in RNA-seq reads. Applying our method to cumulative C. elegans data reduced 496:2GB of aligned RNA-seq SAM files to 410MB of splice graph database written in FASTA format. This corresponds to 1000 compression of data size, without loss of sensitivity. We performed a proteogenomics study using the custom dataset, using a completely automated pipeline and identified a total of 4044 novel events, including 215 novel genes, 808 novel exons, 12 alternative splicings, 618 gene-boundary corrections, 245 exon-boundary changes, 938 frame-shifts, 1166 reverse-strands, and 42 translated UTR. Our results highlight the usefulness of transcript+proteomic integration for improved genome annotations.


