RNA sequencing is a powerful method to build reference transcriptome assemblies and eventually sample-specific protein databases for mass spectrometry-based analyses. This novel proteomics informed by transcriptomics (PIT) workflow improves sample-specific proteome characterization of dynamic- and especially non-model organism proteomes, and moreover helps to identify novel gene products. With increasing popularity of such proteogenomics applications a growing number of researchers demand qualitative but resource-friendly and easy to use analysis strategies. Most PIT applications so far rely on the initially introduced Trinity de novo assembly tool.
To aid potential users to start off with PIT researchers from the Max Planck Institute for Molecular Genetics compared main performance criteria of Trinity and other alternative RNA assembly tools known from the transcriptomics field including Oases, SOAPdenovo-Trans, and Trans-ABySS. Using exemplary data sets and software-specific default parameters Trinity and alternative assemblers produced comparable and high-quality reference data for vertebrate transcriptomes/proteomes of varying complexity. However, Trinity required large computational resources and time. The researchers found that alternative de novo assemblers, in particular SOAPdenovo-Trans but also Oases and Trans-ABySS rapidly produced protein databases with far lowerer computational requirements. By making choice among various RNA assembly tools, proteomics researchers new to transcriptome assembly and with future projects with high sample numbers can benefit from alternative approaches to efficiently apply PIT.
Comparative evaluation of the performance of RNA assemblers Trinity, Oases, SOAPdenovo-Trans (here short SOAPdenovo) and Trans-ABySS for proteomics informed by transcriptomics (PIT).
A. Density distribution of relative length of open reading frames (ORF) predicted from RNA sequencing data using the four different transcriptome assemblers. ORF sequences were aligned to the UniProtKB reference proteome and relative length was calculated as ratio of query (ORF) and target (UniProtKB reference sequence) length. B. Box plot of peptide candidates per precursor ion within +/-10 ppm mass tolerance as estimation for the search space in peptide spectrum matching. C. Bar plot shows tandem mass spectra (left bar) and distinct peptide sequences (right bar) as identified by MaxQuant analysis at 1% FDR using differently generated protein sequence databases in target-decoy searching of publicly available discovery proteomics datasets. Identifications were compared to those obtained from searching the UniProtKB reference proteome and proportions indicate shared (black, coinciding), novel (grey) and differential (red, conflicting) results. Additionally, red numbers indicate percentage of conflicting versus total identifications. D. Density distribution of relative length of the subset of ORFs as depicted in A of which evidence on protein level (C) was found.