HIGH-THROUGHPUT METHODS FOR IN SILICO DISCOVERY OF PEPTIDES, PROTEINS, AND POST-TRANSLATIONAL MODIFICATIONS IN PROTEOMICS
Series: Final Public Oral Examinations
Location: Elgin Room (E-Quad A224)
Date/Time: Monday, September 24, 2012, 4:30 p.m. - 6:00 p.m.
The field of proteomics seeks to address a grand problem in biology where large-scale determination of the gene and cellular function of an organism is directly analyzed at the protein level. Over the last decade, liquid chromatography coupled with tandem mass spectrometry (LC-MS/MS) has emerged as a prominent tool within the field due to the capacity for high-throughput and high-sensitivity experimental designs. The resulting output from LC-MS/MS systems often include thousands of MS/MS spectra, each of which is a complex piece of data that must be analyzed to extract relevant information about the proteins contained in a cellular sample. These data sets are often noisy and therefore require sophisticated and robust tools that are capable of efficiently processing the information. This thesis presents several mathematical models and algorithms that address three major areas of open research problems in proteomics: (1) post-translational modification (PTM) identification at the peptide level, (2) unmodified and modified protein identification, and (3) determination of optimal biomarker combinations.
When conducting a LC-MS/MS experiment, the prime objective is the identification of
a complete list of samples proteins along with all identified PTMs. This is a major challenge due to the vast increase in computational complexity obtained from introduction of over 900 modifications to a typical 20 amino acid universe. Two novel algorithms were developed based on integer linear optimization for (1) the identification of a comprehensive list of all proteins and (2) the untargeted identification of all modifications along a template peptide sequence. Existing peptide identification algorithms are utilized to initially determine all unmodified peptides which are input to the protein identification algorithm to determine the list of all sample proteins. An untargeted search for all modified amino acid sites within the protein list is then performed using a universal set of all PTMs. Demonstration of these algorithms results in superior accuracy on both small and large-scale data sets when benchmarked against existing state-of-the art methods. The complete suite of algorithms was fully integrated into a webtool that was made freely available to the scientific community (http://pumpd.princeton.edu).
Using the above algorithms, gingival crevicular fluid (GCF) samples were analyzed to
identify novel biomarker combinations of proteins that could effectively diagnose individuals that are either periodontally healthy (PH) or afflicted with chronic periodontitis (CP). A training set of 12 PH and 12 CP samples identified 432 human and 30 bacterial proteins, 150 of which were not previously identified in large-scale proteomics analysis. GCF samples were obtained from 72 additional subjects, and a mixed-integer optimization model was developed to identify the optimal combination of biomarkers for diagnosis of PH or CP individuals. A thorough cross-validation of the model capability was performed on a training set of 55 samples, and greater than 99% accuracy was consistently achieved. The model was then tested on two blind test sets, and using an optimal combination of 7 human proteins and 3 bacterial proteins, the model was able to correctly predict 40 out of 41 PH and CP samples.