Utilizing RNA-Seq data to improve proteomics

Shotgun proteomics utilizes a database search strategy to compare detected mass spectra to a library of theoretical spectra derived from reference genome information. As such, the robustness of proteomics results is contingent upon the completeness and accuracy of the gene annotation in the reference genome. For animal models of disease where genomic annotation is incomplete, such as non-human primates, proteogenomic methods can improve the detection of proteins by incorporating transcriptional data from RNA-Seq to improve proteomics search databases used for peptide spectral matching. Customized search databases derived from RNA-Seq data are capable of identifying unannotated genetic and splice variants while simultaneously reducing the number of comparisons to only those transcripts actively expressed in the tissue.

Texas Biomedical Research Institute scientists collected RNA-Seq and proteomic data from 10 vervet monkey liver samples and used the RNA-Seq data to curate sample-specific search databases which were analyzed in the program Morpheus. They compared these results against those from a search database generated from the reference vervet genome. A total of 284 previously unannotated splice junctions were predicted by the RNA-Seq data, 92 of which were confirmed by peptide spectral matches. More than half (53/92) of these unannotated splice variants had orthologs in other non-human primates, suggesting that failure to match these peptides in the reference analyses likely arose from incomplete gene model information. The sample-specific databases also identified 101 unique peptides containing single amino acid substitutions which were missed by the reference database. Because the sample-specific searches were restricted to actively expressed transcripts, the search databases were smaller, more computationally efficient, and identified more peptides at the empirically derived 1 % false discovery rate.

Sample

RNA-Seq

SSdb Entries

Mass Spectra

PSMs

Peptide IDs

RNA-Seq reads

% reads aligned

Genes

Novel SJs

REFdb

SSdb

REFdb

SSdb

1030

7,040,525

55.5

13,804

4069

80,003

26,525

26,680

9765

9702

1211

6,585,341

68.8

15,782

7171

79,381

27,288

27,673

10,532

10,527

1238

6,594,936

67.1

15,659

6595

78,444

19,600

19,898

9349

9354

1245

6,730,432

64.0

13,901

4089

80,281

29,143

29,193

10,503

10,463

1248

10,504,974

69.4

15,513

7429

80,221

22,205

22,479

9120

9162

1254

9,127,588

62.5

15,936

7641

79,675

23,655

23,738

9334

9385

1291

6,575,182

67.9

13,354

3653

79,960

30,623

30,722

11,478

11,652

1347

6,637,842

56.6

13,284

3147

78,791

17,284

17,037

8633

8575

1448

8,019,158

65.0

15,668

6419

71,853

15,612

15,561

7582

7593

1467

9,983,615

66.2

16,176

7305

78,781

20,101

20,162

9177

9223

 

Descriptive statistics for the RNA-Seq and mass spectrometry analyses utilizing the Vervet reference search database (REFdb, 19,255 gene entries) and the sample-specific databases (SSdb)

Proteogenomic approaches are ideally suited to facilitate the discovery and annotation of proteins in less widely studies animal models such as non-human primates. The researchers expect that these approaches will help to improve existing genome annotations of non-human primate species such as vervet.

Proffitt JM, Glenn J, Cesnik AJ, Jadhav A, Shortreed MR, Smith LM, Kavanagh K, Cox LA, Olivier M. (2017) Proteomics in non-human primates: utilizing RNA-Seq data to improve protein identification by mass spectrometry in vervet monkeys. BMC Genomics 18(1):877. [article]

Leave a Reply

Your email address will not be published. Required fields are marked *

*

Time limit is exhausted. Please reload CAPTCHA.