Proteogenomics methods have identified many non-annotated protein-coding genes in the human genome. Many of the newly discovered protein-coding genes encode peptides and small proteins, referred to collectively as microproteins. Microproteins are produced through ribosome translation of small open reading frames (smORFs). The discovery of many smORFs reveals a blind spot in traditional gene-finding algorithms for these genes. Biological studies have found roles for microproteins in cell biology and physiology, and the potential that there exists additional bioactive microproteins drives the interest in detection and discovery of these molecules. A key step in any proteogenomics workflow is the assembly of RNA-Seq data into likely mRNA transcripts that are then used to create a searchable protein database.
Researchers from the Salk Institute for Biological Studies demonstrate that specific features of the assembled transcriptome impact microprotein detection by shotgun proteomics. By tailoring transcript assembly for downstream mass spectrometry searching, they show that they can detect more than double the number of high-quality microprotein candidates and introduce a novel open-source mRNA assembler for proteogenomics (MAPS) that incorporates all of these features. By integrating their specialized assembler, MAPS, and a popular generalized assembler into our proteogenomics pipeline, the researchers detect 45 novel human microproteins from a high quality proteogenomics dataset of a human cell line. They then characterize the features of the novel microproteins, identifying two classes of microproteins. This work highlights the importance of specialized transcriptome assembly upstream of proteomics validation when searching for short and potentially rare and poorly conserved proteins.
Proteogenomics microprotein discovery pipeline
High quality RNA-Seq data is assembled and three-frame translated to create an in silico custom proteomics database. The database is then used to search MS2 spectra to obtain peptide candidates. Short un-annotated peptides with a high quality MS2 spectra are manually curated to produce a list of novel confident microproteins. Since peptide discovery depends on the assembled transcriptome, we propose to optimize transcriptome assembly for peptide discovery.
Availability-The full MAPS code, sample dataset, and instructions are available from (http://www.bitbucket.org/shokhirev/MAPS)