Friday, 25 May 2018

Proteomics Identifies Eye Disease Treatment with Existing Drugs

Protein Analysis Enables Treatment of Eye-Disease Symptoms with Existing Drugs

Demonstrating the potential of precision health, a team led by a researcher at the Stanford University School of Medicine has matched existing drugs to errant proteins expressed by patients with a rare eye disease.
“Analyzing fluid samples from the eye can totally change how we treat patients,” said Vinit Mahajan, MD, PhD, associate professor of ophthalmology.

The team employed proteomics, the large-scale study of proteins, in identifying four on-the-market drugs that successfully quelled symptoms triggered by several of the overabundant proteins. The findings, although not curative, demonstrate the potential of treating rare and complex eye conditions — and potentially other inflammatory diseases — by matching existing drugs with proteins that are abnormally expressed in the eyes of individual patients.

A paper describing the research was published online Dec. 21 in JCI Insight. Mahajan is the senior author. Gabriel Velez, an MD-PhD student at the University of Iowa and a visiting research scholar at Stanford, is the lead author.

“Patients with rare diseases often have few therapeutic options, and many conventional therapies fail them,” Velez said. “Proteomic profiling allows for clinicians to analyze a patient’s diseased tissue in real time and identify proteins that are targeted by already-approved drugs.”

The genetic eye disease is called neovascular inflammatory vitreoretinopathy, or NIV. It is a progressive, inflammatory condition that evades all current therapies. Patients with the disease eventually go blind. It is exceptionally rare and only affects a handful of families worldwide, including an Iowa family studied and treated by Mahajan when he was at the University of Iowa. Initial symptoms, which include internal inflammation of the eye, cataract development and vision changes, usually appear when individuals are in their 20s. By the time patients are in their 50s and 60s, their vision has usually deteriorated into blindness, Mahajan said. Sometimes, eyes become so atrophied and painful they need to be removed.

Although a team led by Mahajan identified the gene responsible for the disease in 2012, a cure has yet to be developed. In the past, clinicians treating NIV patients used a trial-and-error approach to treat each symptom as it appeared, Mahajan said.

“This constant uphill battle to save the vision of NIV patients made us determined to find the molecules active inside the eye that can lead us to better therapies,” Mahajan said.

Analyzing biopsies


Researchers performed liquid biopsies: A small amount of intraocular fluid was extracted from eight eyes of members of the same family at various stages of NIV and, as a control, four eyes with a noninflammatory eye disease. The researchers looked for 200 types of immune-signaling molecules called cytokines and identified 64 that were different, usually more abundant, in the eyes affected by NIV. In addition, they found that a protein called TNFalpha, which is usually abnormal in inflammatory conditions, was not elevated. Clinicians had previously assumed it was a culprit in NIV, yet infusions of an anti-TNFalpha drug called infliximab hadn’t alleviated symptoms for patients with the disease. Now they know why.

“That’s one of the important stories here,” Mahajan said. “There’s no target for the drug inside the eye, which totally explains why the drug didn’t work. One of the benefits of this precision health approach is to identify which drugs we shouldn’t use.”

The researchers worked through the symptoms of the disease, matching each one with abnormal levels of proteins from the biopsies, and then with existing drugs that could potentially provide relief.

One feature of NIV is the excessive formation of new blood vessels in the retina, which can lead to a blinding hemorrhage. Clinicians traditionally treat this type of hemorrhage with surgery. However, eye surgery prompts inflammation and scarring in NIV patients. The protein analyses revealed elevated levels of vascular endothelial growth factor, or VEGF, that correlated with moderate cases of the disease. The researchers turned to an anti-VEGF drug, bevacizumab, which had been approved to treat macular degeneration. Injections of bevacizumab were able to resolve hemorrhages in seven NIV eyes without surgery, the study said.

Drug reduces eye inflammation


Another symptom of NIV is cellular inflammation and leakage of proteins into the eye, which patients experience as cloudy vision, akin to looking through a fog, with occasional “floaters” or shadows, Mahajan explained. Through the protein analysis, researchers spotted several abnormally expressed molecules that drive these inflammatory symptoms. The molecules help determine the fate of developing immune cells known as T cells. After reviewing existing drugs, the researchers decided to try injections of methotrexate, which targets T cells in arthritis. It successfully reduced the number of inflammatory cells in the eye, the study said "One of the benefits of this precision health approach is to identify which drugs we shouldn’t use".

The protein analysis also revealed numerous inflammatory compounds that could be treated with steroids, so the research team decided to put steroid implants into three of the NIV eyes. The implants, which release steroids continuously over two years, reduced levels of 31 molecules related to inflammation and the eye’s abnormal immune response.

But the study participants eventually developed scarring around the implants, suggesting that some of the proteins contributing to inflammation weren’t effectively blocked by steroids. And even after the steroid implantations, these eyes still expressed elevated levels of the inflammation-related protein interleukin-6.

The researchers used this clue to solve another issue in NIV: Scar tissue grows inside the eye, causing the retina to detach from the back of the eye. Surgery is the only option for treating this complication, but in NIV the scarring is so extensive the retina detaches again after the surgery. The researchers gave one participant in the study, who had already lost the use of one eye, a drug called tocilizumab to block interleukin-6 after reattaching her retina. The surgery was effective, marking the first time a retina has remained attached in an NIV patient and suggesting the drug may be beneficial to others with scarring inside the eye, Mahajan said.

The researchers also used the protein analysis to supplement and confirm the classification of NIV into five stages of severity. Now, in addition to physical symptoms, they know which proteins are most active in each stage, Mahajan said. For future patients, a liquid biopsy from the eye could help provide more precise diagnoses

Wednesday, 23 May 2018

PROTEOMIC TESTING RECENTLY ADDED TO STANDARD TREATMENT GUIDELINES FOR NON-SMALL CELL LUNG CANCER

Proteomic Testing Recently Added to Standard Treatment Guidelines for Non-Small Cell Lung Cancer: What This and Other Recommended Tests Mean for Patients
If you have been searching for information about lung cancer treatment recently, it is likely that you discovered that doctors are performing genetic “biomarker” testing on tumor biopsies to test for changes in the genes EGFR and ALK. Patients with tumors that are positive for changes in EGFR or ALK (around 15-20% of patients with non-small cell lung cancer [NSCLC] in the United States) have been able to benefit from drugs that target the specific change.
Currently, the National Comprehensive Cancer Network Clinical Practice Guidelines in Oncology (NCCN Guidelines®) recommends that patients receive genetic testing for EGFR and ALK, which require a biopsy of the tumor tissue in order to test the tumor directly. However, 25 to 30% of lung cancer patients do not have sufficient biopsy material or cannot undergo procedures to get the biopsy material needed for these types of genetic tests. More recently, there are several blood-based tests where changes in either EGFR or ALK genes can be measured by isolating tumor-associated DNA from blood. These so called ‘liquid biopsies’ make genetic testing easier on patients since the test doesn’t require any surgery.  
Another breakthrough in blood-based testing is a proteomic test called VeriStrat®, which tests proteins instead of DNA, and has recently been added to the NCCN Guidelines.  This test can help determine whether patients entering the second line of treatment for advanced NSCLC could be candidates for the targeted drug erlotinib, which may have fewer side effects and greater convenience over standard chemotherapy. Scientists and the medical community are continuing to progress their understanding of the biology of cancer and are developing other tests to help guide treatment decisions in a personalized way.

Tests like VeriStrat and others will help the medical community find the right treatment for the right patient at the right time. These exciting advances are redefining patient care in lung cancer and in medicine generally.  Be sure to talk to your doctor about what tests might be right for you.



Proteomics: improving biomarker translation to modern medicine

Biomarkers are defined as 'measurable characteristics that reflect physiological, pharmacological, or disease processes' according to the European Medicines Agency The ideal platforms for biomarker discovery include genomic, transcriptomic, proteomic, metabonomic and imaging analyses. However, most biomarkers used in clinical studies are based on proteomic applications as the majority of current drug targets are proteins, such as G protein-coupled receptors, ion channels, enzymes and components of hormone signaling pathways Furthermore, linking the results of biomarker studies using protein-protein interaction approaches can assist in systems biology approaches and could lead to hypothesis generation and identification of new drug targets.
Proteomic-based approaches for biomarker investigation can be employed in different aspects of medicine, such as elucidation of pathways affected in disease, identification of individuals who are at a high risk of developing disease for prognosis and prediction of response, identification of individuals who are most likely to respond to specific therapeutic interventions, and prediction of which patients will develop specific side effects. In line with this, biomarkers can also be used for patient monitoring such as testing for 'normalization' of a biomarker signature in response to treatment or screening for re-appearance of a characteristic 'pathological' signature. All of this equates to improvement in patient care by using biomarkers in so-called personalized medicine approaches The progress and challenges in the translational application of proteomic technologies are highlighted in this new series, which features reviews written by leaders in the field on topics including post-translational modifications and protein-protein interactions in disease.
Currently, there are only a few molecular tests that can predict response to certain treatments and these are mainly restricted to the field of oncology. Perhaps the best example of this is human epidermal growth factor receptor 2 (HER2) expression in breast cancer cells. This cell surface receptor can be blocked by the antibody-based therapeutic Herceptin™ (trastuzumab). Such successes have raised hopes for discovery of biomarkers in other areas of medicine. However, in most cases, the claims for other novel biomarker candidates have not been proven in validation studies or in clinical trials. Potential reasons for this include deficiencies in design and analysis, the problem that drug targets and biomarkers may not be causal to the disease but rather a result of the disease process or a co-morbid effect, a lack of congruence of preclinical models with the human disease, or even because of factors such as the incorrect enrolment of patients in clinical trials who are too advanced in their disease stage to show any response to potential therapeutics. Nonetheless, a consensus has now been reached for testing biomarker candidates in the earliest stages of a disorder, as described recently for neurodegenerative conditions such as Alzheimer's disease.


The suggestion that biomarker research has not lived up to the initial hype comes from the fact that publicized multiple 'breakthrough' tests have still not reached the market. This has led to skepticism from clinicians, scientists and regulatory agencies, which might make the introduction of valid biomarkers into clinical diagnostics or the drug discovery industry even more difficult. This is due in part to the lack of a connection between biomarker discovery with technologies for validation and translation to platforms that provide accuracy and ease of use in a clinical setting. Apart from some biomarkers in the field of cancer research, most have not been validated and have now faded from the spotlight. Major cancer biomarkers that have received Food and Drug Administration (FDA) approval over the last few decades include prostate-specific antigen (PSA) for prostate cancer, carcinoembryonic antigen (CA)-125 for ovarian cancer and CA-19-9 for pancreatic cancer. However, apart from the possible exception of PSA, most of these have been used mainly for monitoring treatment response and are not suitable for early diagnosis.

It has been suggested that the best strategy for biomarker qualification is through their co-development with drugs. One of the best examples of this is the determination of the HER2 subtype of the epidermal growth factor receptor, combined with use of Herceptin™, as described above. In this case, patients who have high levels of HER2 are more likely to respond to Herceptin™ treatment. Thus, the use of scientifically and analytically validated biomarkers and rationally designed hypothesis-testing may lead to a paradigm shift in drug discovery and clinical trials. Researchers are now required to show that biomarkers are validated before they can be used in regulatory decision-making. According to the FDA, there are now three types of biomarkers based on proof of concept, validity and reproducibility. The last category, which requires accurate replication of the findings, is where most promising biomarker candidates have fallen short. Currently, only long-standing and well-established tests have been used for regulatory decision-making, such as fasting glucose tolerance or glucose clamping to monitor insulin sensitivity.
Three types of biomarkers for use in clinical studies
Biomarker type
Requirement
Exploratory biomarkers
Evidence for scientific proof of concept
Probable valid biomarkers
Measurement in an established analytical test system and evidence explaining the clinical significance of the results
Known valid biomarkers
Biomarker test results should be accurately replicated at different sites, laboratories or agencies in cross-validation experiments

It is clear that there is still a long way to go before the potential of proteomics can be entirely utilized in the preclinical and clinical fields, slowly progressing from bench to bedside and back again in an ongoing endeavor to improve patient outcomes. The tight regulations and concerted efforts outlined above will be essential in this journey. However, there is now optimism that further technological developments and interdisciplinary approaches will continue to advance the field of biomarkers so that its impact on modern medicine can be fully realized.

Tuesday, 22 May 2018

Simplifying Proteomics

PTMScan® Technology – Immunoaffinity enrichment for mass spectrometry-based proteomics


Mass spectrometry techniques can be used to profile post-translationally modified (PTM) peptides in a digested sample. However, PTM peptides are often present at very low abundance, which is a challenge when trying to study them using mass spectrometry techniques.
To address this challenge Cell Signaling Technology (CST™) has developed PTMScan® technology, a proprietary proteomic method that combines antibody enrichment of PTM-containing peptides with liquid chromatography mass spectrometry (LC-MS/MS). PTMScan technology allows identification and quantification of hundreds to thousands of even the lowest abundance peptides, and provides a more focused approach to peptide enrichment than other strategies (for example IMAC).
PTMScan technology can be used to:
  • Determine novel protein sites that are phosphorylated, ubiquitinated, acylated, methylated, or protease cleaved.
  • Identify and validate drug targets.
  • Discover biomarkers.
  • Explore the mechanism of action of drugs/chemical modulators.
  • Elucidate off-target drug effects

Sunday, 20 May 2018

Common proteomic technologies, applications, and their limitations.

Technology
Application
Strengths
Limitations
2DE
Protein separation
Relative quantitative
Poor separation of acidic, basis, hydrophobic and low abundant proteins.
Quantitative expression profiling
PTM information.

DIGE
Relative quantitative
Protein separation
PTM information
Proteins without lysine cannot be labelled Requires special equipment for visualization and fluorophores are very expensive
Quantitative expression profiling
High sensitivity
Reduction of intergel variability

ICAT
Chemical isotope labelling for quantitative proteomics
Sensitive and reproducible
Proteins without cysteine residues and acidic proteins are not detected
Detect peptides with low expression levels.

SILAC
Direct isotope labelling of cells
Degree of labelling is very high
SILAC labelling of tissue samples is not possible
Differential expression pattern
Quantitation is straightforward

iTRAQ
Isobaric tagging of peptides
Multiplex several samples
Increases sample complexity
Relative quantification High-throughput
Require fractionation of peptides before MS.

MUDPIT
Identification of protein-protein interactions
High separation
Not quantitative
Large protein complexes identification
Difficulty in analysing the huge data set
Deconvolve complex sets of proteins
Difficult to identify isoforms

Protein array
Quantitate specific proteins used in diagnostics (biomarkers or antibody detection) and discovery research
High-throughput
Limited protein production
Highly sensitive
Poor expression methods
Low sample consumption
Availability of the antibodies
Accessing very large numbers of affinity reagents.

Mass spectrometry
Primary tool for protein identification and characterization
High sensitivity and specificity. High-throughput. Qualitative and quantitative
No individual method to identify all proteins. Not sensitive enough to identify minor or weak spots. MALDI and ESI do not favour identification of hydrophobic peptides and basic peptides
PTM information

Bioinformatics
Analysis of qualitative and quantitative proteomic data
Functional analysis, data mining, and knowledge discovery from mass spectrometric data
No integrated pipeline for processing and analysis of complex data. Search engines do not yield identical results




Friday, 18 May 2018

Cardiovascular proteomics in the era of big data: experimental and computational advances

Abstract

Proteomics plays an increasingly important role in our quest to understand cardiovascular biology. Fueled by analytical and computational advances in the past decade, proteomics applications can now go beyond merely inventorying protein species, and address sophisticated questions on cardiac physiology. The advent of massive mass spectrometry datasets has in turn led to increasing intersection between proteomics and big data science. Here we review new frontiers in technological developments and their applications to cardiovascular medicine. The impact of big data science on cardiovascular proteomics investigations and translation to medicine is highlighted.

Keywords

Cardiovascular medicineClinical proteomicsShotgun proteomicsMass spectrometry

Background

The heart is in many ways an exceptional organ. Proteins at the sarcolemma, sarcomere, mitochondrion, and other cardiac organelles must orchestrate vital functions seamlessly on a beat-by-beat basis, while dynamically adjusting energetic and contractile outputs to environmental cues within seconds. Heart diseases including cardiac hypertrophy and failure are characterized by complex remodeling of various protein signaling networks and subcellular components, which often involve a multitude of collaborating proteins. Therefore, understanding how multiple protein species interact to carry out higher physiological phenotypes and regulation has been an important objective of cardiovascular research. The power of proteomics to simultaneously provide information on the panoply of expressed proteins has made it uniquely suitable for resolving complex signaling conundrums and revealing disease mechanisms in the heart.
Advances in genome sequencing are often celebrated to have outpaced even the vaunted Moore’s law of computing power [1]. Lesser known but equally impressive is the parallel surge in the capacity of mass spectrometry-based proteomics last decade. To wit, when the first draft of the human genome was published in 2001, a state-of-the-art two-dimensional electrophoresis technique could identify ~200 proteins in 3 days. Fast-forward to today, a modern mass spectrometer can generate over a million spectra per day and quantify ~4000 proteins in the course of 1 h [2]. This quantum leap is attributable to parallel advances in three areas: (1) analytical chemistry in sample processing and liquid chromatography tandem mass spectrometry (LC–MS/MS) instrumentation; (2) bioinformatics and computational tools in high-throughput data processing and analysis; and (3) completeness and accuracy of sequence and annotation databases. Riding on growing experimental capacity, there have been continued improvements to the experimental coverage of proteome analysis, the interpretability of data, the reliability of results, and the diversity of protein parameters that may be interrogated. New applications and experimental designs not possible a few years ago are now being exploited to explore new regulatory modalities in cardiac physiology.
Cardiovascular proteomics has grown rapidly in the intervening period, with >400 studies now being published yearly (Fig. 1) [3]. To put into context, we recall two landmark reviews of cardiovascular proteomics, in 2001 [4] and 2006 [5], which noted that although many enabling technologies were emerging, cardiovascular proteomics remained a field ‘on the threshold’ of future applications. Fast forward to the present and it is clear that proteomics has had a transformative impact on cardiovascular sciences, as recounted in recent review articles. We attempt to complement these reviews here with a concise overview on the lockstep improvements in the analytical (separation sciences and mass spectrometry) and computational (data science and algorithms) advances of the past 5 years that enabled landmark studies, as well as ongoing developments driving the next stage of applications.
Figure 1
Fig. 1
Trends in cardiovascular proteomics. Both (a) the volume of proteomics studies, and (b) the size of proteomics dataset have skyrocketed in the last decade. a The number of cardiovascular proteomics studies has increased approximately 400 % from 2004 to 2014, far outpacing the natural growth of the cardiovascular field, indicating increasingly common adoption of the technologies. bThe protein coverage of proteomics experiments in the same time period has experienced considerable growth also, quantified as the numbers of identifiable cardiac proteins in an experiment. The maximum number of cardiac proteins (dashed lines) is based on estimated significantly expressed loci in the mouse heart and does not take into account proteoforms such as resulting from alternative splicing. This increase is driven by parallel advances in hardware instrumentation and computational technology. Coinciding with the notion of “complete proteomics”, proteomics studies can now interrogate more proteins of interest such as chromatin remodeling factors and transcription factors that express at low copy numbers. Effective means to analyze big proteomics dataset are becoming a new frontier of growth in cardiovascular proteomics

Experimental and analytical advances

Improvements in analytical methods

An early hurdle that bedeviled cardiovascular proteomics was the limitation in the sensitivity and dynamic range of protein detection, which skewed results towards few high-abundance proteins (e.g., contractile proteins) and masked low-abundance species. This is due to the complexity of proteomes. The mammalian heart is known to express at least ~8000 genes at significant levels [6], and at least 8325 human proteins have been referenced in the ~1.4 million cardiac-related publications on PubMed [7]. Each human gene can encode multiple proteoforms, e.g., due to the average ~4 alternative splicing isoforms per human gene plus many more post-transcriptional and post-translational editing processes, resulting in at least ~106 proteolytic peptides per sample. With the addition of post-translational modifications (PTMs)—e.g., the four histone proteins alone have identified PTMs on at least 105 different residues in myriad combinations [8]—the total proteome complexity is likely orders of magnitude more complex still.
In the past decade, great strides have been made to improve proteome coverage, from how protein samples are extracted to mass spectrometry instrumentation. To perform proteomics analysis, it follows that the proteins must be effectively extracted and released from biological samples. This is typically achieved via mechanical homogenization or chemical surfactants. Protein solubilization techniques in early proteomics protocols were at times ineffectual in extracting hydrophobic or membrane proteins, which often aggregated out of the sample and led to their non-detection. Development in this area in recent years have led to commercially available, mass spectrometry compatible surfactants [9], size exclusion filter-mediated buffer exchange techniques [10], as well as empirically refined experimental protocols that are optimized for the analysis of various cardiac subproteomes [11]; all of which serve to expand the portion of the proteome that is open to mass spectrometry exploration. The incompleteness of proteolytic digestion was once found to be a limiting factor of the maximal peptide coverage of the experiment and contribute to batch-to-batch variations. The use of optimized proteolysis protocols including double lys-C/trypsin proteolysis is gaining traction [1].
Advances in separation sciences have had a particularly tremendous impact on reducing the complexity of peptides prior to mass spectrometry signal acquisition. High-performance shotgun proteomics using mass spectrometry has supplanted two-dimensional (2D) electrophoresis to become the de facto standard for large-scale analysis of cardiac proteins (see general workflow of shotgun proteomics in Fig. 2). Whereas the now-dethroned 2D electrophoresis was limited to detecting a few hundred proteins, contemporary LC–MS experiments can resolve peptides from >10,000 proteins to allow their identification and quantification. Since 2001, separation science has led in a relentless pursuit to increase protein coverage [1213], with the success of strong cation exchange-reversed phase based MudPIT approaches followed successively by other 2D-LC approaches including reversed-phase-reversed-phase separation [14] as well as very-long separation columns with high peak capacity, nano-scale microfluidic devices driven by ultra-high pressure LC systems [15] and capillary electrophoresis separation (see [1617] for reviews on separation science developments). By separating the peptide samples into smaller subset based on their chemistry, a simpler mixture of peptides is introduced into the mass spectrometer in any given time, which decreases ion competition and increases sensitivity.
Figure 2
Fig. 2
Analytical and computational overview in protein identification. 1 Cardiac samples are processed to extract the proteomes or subproteomes of interest, which may then be proteolyzed to obtain peptide digests. 2 The resulting peptides are desalted and subjected to LC–MS/MS analysis to acquire MS1 and MS2 spectra. 3 The peptide sequences that are present in the MS dataset can be identified using a database search approach, which uses a sequence database (e.g., UniProt) to generate theoretical peptide sequence and predict their fragmentation patterns in silico, then automatically find the best-match theoretical spectra to the experimental spectra for protein identification. Alternatively, the proteins can be identified using a spectral library search. The resulting protein datasets can be further analyzed to extract other biomedically meaningful information (see Fig. 4)
State-of-the-art instruments including hybrid Orbitraps and time-of-flight instruments achieve high performance by virtue of their high scan speed (allowing more peptides to be analyzed in the same analysis), sensitivity (allowing minute amounts of samples to be analyzed), and mass resolution (increasing power to differentiate similar peptide species). Recent proteome profiling experiments of the mammalian heart using the latest and greatest LC–MS combinations routinely achieve 5000 or more proteins identified in an experiment (Fig. 1b). In a recent survey we quantified the relative abundance of 8064 proteins in the mouse heart, covering more than 10 major organelles and 201 major cellular pathways [18]. As little as micrograms of proteins are sufficient for shotgun proteomics analysis. This amount may come from milligrams or less of cardiac biopsies, or ~105 cultured cardiac cells, opening the door of proteome analysis to more experimental and clinical designs where sample amounts may be limiting.
Taken together, these advances have helped solve a principal challenge to proteomics applications, namely how to successfully detect the maximal number of peptides inside the overwhelmingly complex mixture that is the cardiac proteome. Although each technological development is incremental, over time they accrued into a qualitative transformation on the power and utility of proteomics, when proteins of biomedical interest gradually became measurable and discoverable in large-scale experiments. Catalogs of so-called “complete proteomes” (i.e., here narrowly defined as the detection of one protein product of every expressed locus in the genome) have now been described for many human tissues and organs, including the heart [1920]. Therefore, although earlier cardiovascular proteomics studies were best equipped to discover changes in structural, contractile, or metabolic housekeeping proteins, contemporary studies can now easily interrogate regulatory proteins including membrane receptors, kinases, ubiquitin ligases, and chromatin remodeling factors, whereas the analyses of yet scarcer species such as transcription factors and cytokines are now on the cusp of routine applications.
Early applications were also plagued by the variability of quantification results, which limited power to discover significant changes between disease model and control samples. A major source of variability in proteomics quantification originated from the variable detectability of peptides with different amino acid compositions. Two equimolar peptide sequences can show rather different intensities in MS signals. Accurate prediction of peptide intensity based on sequence information remains an unsolved issue in computational proteomics due to the large number of combinatorial variables that contribute to signal behaviors. Several methods have been developed to normalize peptide intensity and achieve accurate quantification. Targeted MS methods, such as Multiple Reaction Monitoring (MRM) [21], allow users to program the mass spectrometer to scan for only targeted peptide ions for quantification. An advantage of targeted MS is the gain in reproducibility and sensitivity, which can avail the detection of low-abundance proteins at their native concentration. Targeted assays have been successfully developed such that very low amount of proteins in the sample can be quantified, as in the case of troponin I [22]. Isotope labeling methods including SILAC and Super-SILAC [2324] can also reduce variability in relative quantification by ensuring peptides from multiple samples are compared in identical experimental conditions, but require additional labeling steps.
With advances in data acquisition methods, non-targeted label-free techniques can also reliably deduce accurate protein intensity from shotgun experiments directly through bioinformatics analysis. Label-free quantification is analogous to deducing transcript abundance from read counts in next-generation sequencing. Existing approaches largely fall into two categories (Fig. 3). Spectral counting exploits the bias of shotgun proteomics towards abundant proteins, and calculates protein quantity from the stochastic sampling frequency of peptide ions, i.e., the higher the protein abundance, the more of its MS spectra are likely to be identified. A major advantage of spectral counting is that it quantifies directly from the identification output and thus is compatible with most workflows. Spectral counting algorithms tally the number of redundant spectra for each identifiable peptide, then sum the numbers of spectra for all peptides assigned to a protein. On the other hand, ion intensity approaches integrate the intensity mass-specific ion signals over time in the chromatographic space. This utilizes a feature detection step in data analysis to read raw MS files and integrate the corresponding areas-under-curve of each peptide ion over time. Both labeled and label-free methods provide a useful guide to differential protein expression, and can now be used to discover candidate disease protein that can then be validated by further studies.
Figure 3
Fig. 3
Common label-free quantification in proteomics studies. Two common label-free quantification approaches in use are based on spectral count (top) and ion intensity (bottom). Top: spectral counting methods leverage the fact that in stochastic shotgun profiling, the frequency of a protein being sampled by the instrument scales with its relative abundance in the sample. The numbers of spectra matched to an identical protein in healthy (green) versus diseased (red) samples can therefore be compared if appropriate normalization and bioinformatics workflows are implemented. Bottom: ion intensity methods integrate the total signal intensity of peptide ion signals in the mass spectrometer to infer protein quantity. Software is now available to automatically identify and quantify peptide ion signals from mass spectra

Improvements in software tools

The massive amount of MS data generated in proteomics experiments requires computational aid for effective data processing and analysis. A growing number of open-access computational tools concerning all steps of proteomics data analysis are now freely available to users, a subset of which are listed in Table 1.
Table 1
Selected open access software tools in proteomics
Open access tools
Language/framework
License
Publication
Website
Database search engine (untargeted proteomics)
Comet*,†
C++
Apache 2.0
[93]
[94]
MS-GF+*,†
Java
Custom/Academic
[31]
[95]
MSAmanda
C#/Mono
Custom/Academic
[96]
[97]
ProLuCID*,†
Java
Custom/Academic
[26]
[98]
X!Tandem*,†
C++
OSI Artistic
[99]
[100]
Targeted proteomics and/or data-independent acquisition
Skyline*
C#
Apache 2.0
[101]
[102]
OpenSWATH*,†
C++
BSD 3-Clause
[103]
[104]
Protein inference and/or search post-processing
Percolator*,†
C++
Apache 2.0
[34]
[105]
ProteinProphet*,†
C++
GNU LGPLv2
[106]
[107]
ProteinInferencer
Java
Custom/Academic
[35]
[98]
Protein quantification
MaxQuant
.NET
Custom/Academic
[108]
[109]
Census
Java
Custom/Academic
[110]
[98]
PLGEM*,†
R
GNU GPLv2
[111]
[112]
QPROT*,†
C
GNU GPLv3
[37]
[113]
Pipelines and toolkits
Perseus
.NET
Custom/Academic
[114]
[115]
Crux*
C++
Apache 2.0
[116]
[105]
OpenMS*
C++
BSD 3-Clause
[117]
[118]
TPP*
C++
GNU LGPLv2
[119]
[120]
Data access and reuse
PeptideShaker*
Java
Apache 2.0
[121]
[122]
PRIDE inspector*
Java
Apache 2.0
[123]
[124]
Proteomics software tools that provide open access to users. Many of these tools are also open source which potentially allows users to participate in the continual development of the tools
* Available open-source source code repository at the time of writing
Platform-independent (Windows, Linux, Mac)
A major computational task in shotgun proteomics is to efficiently interpret the mass and intensity information within mass spectral data to identify proteins. The computational task can be formulated thus: given a particular tandem mass spectrum, identify the peptide sequences most likely to have given rise to the set of observed parent molecular mass and fragment ion patterns in a reasonable time frame. A general solution to this problem is “database search”, which involves generating theoretical spectra based on in silico fragmentation of peptide sequences contained in a protein database, and then systematically comparing the experimental MS spectra against the theoretical spectra to find the best peptide-spectrum matches. The SEQUEST algorithm was the first proposed to solve this peptide spectrum matching problem in 1994 [25] and its variants (e.g., Comet, ProLuCID [26]) remain among the most widely utilized algorithms to-date for peptide identification. SEQUEST-style algorithms score peptide-spectrum matches in two steps, with the first step calculating a rough preliminary score which empirically restricts the number of sequences being analyzed, and the second step deriving a cross-correlation score to select the best peptide-spectrum match among the candidates. Recent descendants of the SEQUEST algorithms have focused on optimizing its searching speed as well as improving the statistical rigor of candidate sequence scoring, with some programs reporting ~30 % more peptides/proteins identifiable from identical MS datasets and better definition of true-/false-positive identifications [26272829]. Other search engines also exist that are commonly in use, including X!Tandem, which calculates the dot product between experimental and theoretical spectra, then derives the expectation value of the score being achieved in a random sequence match; MaxQuant/Andromeda, which considers fragment ion intensities and utilizes a probabilistic model for fragment observations [30], MS-GF+ [31], and others. Methods have also been developed to combine the unique strengths and biases of multiple search engines to improve total protein identifications [32].
Means to distinguish true and false positives are critical to all large-scale approaches. The “two-peptide rule” was once commonly adopted to decrease false positives at the protein level by requiring each protein to be identified by at least two independent peptides. However, this rather conservative rule could inflate false negatives, as some short or protease-incompatible proteins may only produce maximally one identifiable peptide. More recent conventions involve foregoing the two-peptide rule and instead estimating the false discovery rate (FDR) of identification through statistical models, often with the aid of decoy databases. The use of decoy databases/sequences (reversed or scrambled peptide sequences), allows a quick estimation of the number of false positive proteins, by assuming identical distribution in protein identification scores for false positive hits and the decoy hits. A maximum acceptable FDR can then be specified (conventionally 1–5 %) to determine which protein identifications are accepted in the final result. To explicitly reveal the posterior probability of any particular identification being correct (also called the local FDR), a mixture model has been used that assumes that the peptide identification result is a mixture of correct and incorrect peptides with two distinct Poisson distributions of identification scores [33]. Auxiliary determinants including the presence of other identified peptides from the same proteins can also be applied to infer overall likelihood of protein assignment [33]. Machine learning algorithms (e.g., Percolator) have been demonstrated to build classifiers that automatically distribute peptide spectrum matches into true and false positives [34]. New inference approaches have also been demonstrated that consider peptide and protein level information together to improve the confidence of identification [35].
With the increase in data size and multiplexity (number of sample compared) in proteomics experiments, statistical approaches to analyze data have also evolved to tackle high-dimensionality data. Whereas early studies utilized mostly confirmatory statistics, modern proteomics datasets typically contain thousands of features (e.g., protein expression) over a handful of observations, hence simply testing whether each protein is significantly altered across experimental conditions can result in under-analysis and failure to distinguish latent structures across multiple dimensions, e.g., whether there exists a subproteome of co-regulated proteins across multiple treatment categories. To gain biological insights, quantitative proteomics datasets are now routinely mined using statistical learning strategies that comprise feature selection (e.g., penalized regression methods), dimensionality reduction (e.g., principal component analysis), and both supervised and unsupervised learning (e.g., support vector machine and hierarchical models) to discern significant protein signatures, disease-implicated pathways, or interconnected co-expression networks (Fig. 4).
Figure 4
Fig. 4
Proteomics data mining and functional annotations. Common computation approaches to extract information from massive proteomics datasets include (1) unsupervised cluster analysis, class discovery and visualization; (2) motif analysis and annotation term enrichment; (3) statistical learning methods for disease signature extraction; (4) network analysis; and (5) annotations with other functional information including protein motifs and cardiac disease relevance
Improvements to computational methods that allow more robust results from label-free quantification are an area of active research. For example, recent works (e.g., QProt) have attempted to resolve the respective quantities of multiple proteins that share common peptide sequences in spectral counting, either using weighted average methods or more statistically motivated models [3637]. In ion intensity approaches, chromatographic features that correspond to peptide signals over mass- and retention time-dimensions are identified using image analysis or signal processing algorithms. Because LC gradients are seldom perfectly reproducible, nonlinear distortions in retention time may occur. To ensure identical ions are compared between experiments, automatic chromatographic alignment and clustering methods are used. Some software can identify small chromatographic features based on accurate mass/retention time alone, such that some peptides may be quantified even in experiments where they were not explicitly identified. These processes tend to become computationally expensive for large experimental files [38], thus faster solutions are continuously developed.
With the proliferation of inter-compatible tools, an ongoing trend is to daisy-chain individual tools into user-friendly pipelines that provide complete solutions to a set of related data analysis problems. An ideal proteomics pipeline may combine identification, quantification, and validation tools in a modular organization accessible from a single location. Computation may be performed on the cloud to avoid the need to repeatedly copy, transfer, and store large files. Researchers can carry out computational tasks remotely from the browser on any computer system, obviating the need for redundant infrastructure investments. Currently, the Trans-Proteomics Pipeline [39] and the Integrated Proteomics Pipeline [40] are two example “end-to-end” pipelines that connect raw MS proteomics data to analysis output, whereas comprehensive, open-access pipelines have also been demonstrated in other omics fields, including Galaxy for genomics/transcriptomics [41] and MetaboAnalyst for metabolomics [42]. In parallel, tools are also being federated into interoperable networks through open frameworks. A modular and open-source software development paradigm, where individual software functionalities can interoperate via common interfaces and standards, helps ensure that new software can dovetail with existing ones with ease, and that software development may continue following inactivity from the original research team. Examples of such frameworks include the GalaxyP proteomics extension [43], and the proteomics packages within the R/BioConductor framework [44].

Improvements in annotation resources

Not unlike other omics approaches, the success of proteomics experiments relies heavily on having adequate and up-to-date resources to analyze large-scale data. To wit, for protein database search to succeed, it follows that the protein being identified must first be documented on a sequence database. Fortunately, there have been tremendous advances on the completeness (number of true positive sequences recorded) and precision (removal of redundancy or artifacts) of sequence databases such as UniProt and RefSeq. Some databases are manually curated to contain precise information, whilst others strive to be more comprehensive; but many now list precise “complete proteomes” for commonly studied laboratory species. Databases for human and popular laboratory model organisms have seen particular progress in the completeness of annotation in recent years, such that proteomics studies can now be performed similarly well in mice, drosophila, rats, and other organisms to interrogate cardiac physiology. On the horizon, one can foresee an influx of sequence information on protein polymorphism and alternative splicing isoforms. Although current databases primarily originate from genomic translation or cDNA library of specific cell types or genetic backgrounds, “proteogenomics” efforts are underway to translate additional sequences for proteomics studies, which will expand the scope and precision of protein identification. For non-model organisms, alternative search methods have been devised, such as against a custom database generated by manual six-frame translation of genomic sequences.
Data interpretability problems arise when the complex results comprising changes of many proteins could not be easily digested and summarized in terms that are relatable and of value to biomedical and clinical researchers. The biological significance of the implicated protein targets may be interpreted and connected to the growing corpus of biomedical knowledge through functional annotations. Commonly used annotation resources include Gene Ontology for biological functions [45], Reactome or KEGG for curated biochemical and signaling pathways [46], PFAM for protein motifs and homology [47], PhosphoSitePlus for PTMs [48], and so on.
To map molecular data to curated annotations, a class discovery approach is commonly utilized, which looks for annotated properties that are preferentially shared amongst a subset of proteins with interesting quantitative features over the proteome-wide background, e.g., through sequence motif analysis or annotation term enrichment. The principle behind such analyses is to infer biologically relevant processes based on significant overlaps between data features and the data annotation classes. Computational and statistical approaches are used to determine whether particular annotations are over-represented in a particular subset results than would be expected by chance. This allows both bias-free discovery and specific questions to be asked of a dataset, e.g., whether the down-regulated proteins in a heart failure patient are preferentially involved in fatty acid metabolism, etc. Enrichment analyses can be easily carried out using online tools that perform binomial or hypergeometric tests on the over-representation of functional annotations, including NCBI DAVID, WebGestalt, and Pantherdb [495051]. For users conversant in statistical programming and data analysis environment, open-access packages dedicated to proteomics data operation have been made available in languages such as R and Python, including the RforProteomics package [52] for MS data visualization on the R/Bioconductor repository, and the Python Pyteomics library [53] for parsing and processing MS data. These packages allow users to connect upstream MS analysis to downstream functional annotation services that are commonly employed.
We note that many proteomics functional analysis strategies were originally developed for microarray datasets. Although the analytical goals between proteomics and microarray experiments often overlap, i.e., identify functional associations from numerical molecular expression data, several methodological differences merit considerations. The stochastic nature of shotgun proteomics can lead to missing values and high variability in the data for low-abundance proteins. To address this, several proteomics-focused enrichment analysis tools have recently been developed to address specific features of proteomics datasets, e.g., using weighted sampling, which should account for relative abundance and variability of observations and further improve quantification performance. Monte Carlo sampling approaches have also been used to counter the bias of annotations on high abundance proteins when calculating enrichment significance [54].
As in the case for protein identification, the success of functional analysis is contingent upon the completeness and accuracy of annotations in knowledgebases. Commonly used knowledgebases such as Uniprot and Reactome have steadily improved in size and richness of functional information. Nevertheless, it sometimes remains the case that some classes of annotations are more complete, either because they are easily computable from sequence information (e.g., do these phosphoproteins with increased phosphorylation in heart failure share putative kinase domains?) or can be derived from popular experimental designs (e.g., whether a particular cardiac protein localizes to the mitochondrion?), and as a result are more likely to turn up in functional enrichment analyses. On the other hand, annotations on higher-level pathophysiology and tissue-specific regulations are more challenging, because a majority of such information is hidden in unstructured free text in the literature. With >2 million cardiovascular related articles alone on PubMed [55], however, the volume of literature data being published dwarfs the capacity of human expert biocurations to translate them. Hence in recent years many approaches are being pursued to improve biocuration, including crowdsourcing initiatives to leverage public contributors through web-based micro-tasks, as well as text-mining algorithms that comb through the literature and automatically convert free texts into computable formats. Domain-specific knowledgebases (including organelle specific databases [56], and cardiovascular disease specific knowledgebases [57] have also been developed to provide richer cardiovascular contexts in data interpretation.

Examples and frontiers in cardiovascular applications

The aforementioned analytical and computational advances have enabled novel and noteworthy applications in cardiovascular proteomics. An exciting trend is to venture beyond simply inventorying which proteins are present in the heart or the blood, and into quantifying their dynamic and spatiotemporal properties. Protein complexity necessitates that many parameters are needed to sufficiently describe the overall proteome in a particular physiological state. New methodologies continually arise that enable new insights into protein–protein interactions [58], protein homeostasis [59], and spatial distributions [60]. Many molecular parameters are now known to be directly involved in disease pathogenesis thanks to proteomics studies, including the PTMs of proteins, their spatiotemporal distributions, and interacting partners.

Quantifying diverse post-translational modifications

With increased experimental power to detect peptides, rare and hard-to-detect peptides are increasingly analyzable, including many modified by PTMs. Because translational modifications are attached to proteins following synthesis, their chemical identity, position, and fractional quantity cannot be easily predicted from transcripts, necessitating proteomics studies. PTMs have been the focus of proteomics studies for over a decade and these efforts have increasingly begun to bear fruit in various biomedical investigations. In a recent notable example, Lee et al. [61] used a global approach was used to discern the roles of phosphodiesterase (PDE) in cyclic guanosine monophosphate (cGMP) degradation in cardiac signaling. Although once assumed to be a common pathway acting through a single secondary messenger, the subcomponents are modulated by two different enzymes PDE5A and PDE9A at different cellular locations. A global, high-throughput phosphoproteomics profiling approach was used to differentiate the downstream signaling targets of the two pathways, allowing their precise regulations to be elucidated and subclassified. Therapeutic decision and precision medicine may be informed by targeting PDE9A versus PDE5A with different pharmacological compounds.
Evidence suggests that the current investigations into PTMs have barely scratched the surface of their complexity. Over 380,000 PTM events are documented on the PhosphoSitePlus database [48], including acetylation, di-methylation, mono-methylation, O-GlcNAcylation, phosphorylation, sumoylation and ubiquitiylation, etc. on a variety of proteins. But many additional, unknown modifications likely lurk in acquired spectra that await identification, which are the subject of ongoing developments such as using cascade search, open search, or spectral clustering approaches [6263]. In addition to classical studies of phosphorylation and ubiquitination, improved methods to isolate and identify PTMs have fueled investigations into many different kinds of modifications including glycosylation, acetylation, sumoylation, and oxidative modifications that are now known to play critical and indispensable roles in the regulations of core aspects of cardiac physiology. Publications from multiple groups have led to increasing appreciation of the fine regulations of oxidative cysteine modifications (including disulfide bridge, S-nitrosylation, and S-glutathionylation) in cardiac redox regulation [64]. Examples include the discovery of S-nitrosylation at TRIM72 in regulating ischemic injury [65], and the unexpected “moonlighting” of GAPDH in the mitochondria to confer S-nitrosylation in the heart [66]. An increasing number of other examined modifications are likewise now implicated in important cardiac processes, at a rate that far exceeds that which would be attainable in traditional single-target, hypothesis-driven investigations. These include for instance the role of acetylation in the context of mitochondrial metabolism [6768]; sumoylation in the context of heart failure [69]; O-GlcNAcylation in the context of diabetic hearts [70], and lysine succinylation in the context of ischemic injury [71]. These studies represent a broadening of our observable universe, and are driven by both advances in specific purification or labeling strategies, as well as a general increase in MS instrumentation and data analysis maturity that together propel the experimental scope, scale and reproducibility past the threshold of informativeness.

Tracing protein spatiotemporal distributions

The function and functionality of a protein are modulated to a great extent by the spatial milieu in which it is situated, which in turn dictates the substrates and interacting partners with which it comes across. Ongoing studies into the spatial distributions of cardiac proteins are inventorying the protein compositions in different cardiac organelles, with particular successes in cardiac mitochondria and nucleus proteomes, as elucidated via targeted enrichment of specific organelles. More recent studies in other organs are suggesting the possibility that proteins are actively and constantly translocalizing between organelles in response to cues, which can be measured by combining differential centrifugation, isotope labeling, and machine learning techniques. Recent advances in data analysis have allowed such differential centrifugation techniques to be used for pan-organellar mapping. Instead of enriching only for a pure sample of a particular organelle type, a centrifugation gradient here is matched to a supervised classification algorithm to classify proteins based on their sedimentation behaviors. The average intracellular position of many proteins can therefore be discovered by their grouping with known organellar markers. These approaches can be readily adopted to understand dynamic protein translocalization from one organelle to another, using new analytical frameworks that can quantify protein translocation in differential centrifugation experiments [6072]. Through proteomics studies, it is also demonstrated that many proteins important in the heart can have multiple localizational isoforms that carry out different functions [73].
Secondly, the synthesis rates of proteins have also proven important to tracing the progression of cardiac hypertrophy preceding heart failure. Interests in classical physiology to understand skeletal and cardiac muscle mass gain have propelled technological and software developments aimed at understanding protein turnover, which can be applied to other fields. There have been particular developments in isotope labeling and kinetic modeling methods, which have elucidated the half-life of proteins in many various cellular compartments in the heart [74757677]. Developments in data analysis methods are particular important in this area, as although stable isotope administration and mass spectrometry approaches have been in use for decades, MS data measuring the rate of isotope incorporation cannot be efficiently analyzed on a very large scale without specialized software that can deconvolute isotope patterns and fit massive datasets to kinetic models [7879]. Following these advances, in vivo studies are revealing a previously unknown regulatory layer and architecture of the proteome in which functionally associated proteins share more synchronous turnover rates. During disease development, protein pathways have been found to deviate from physiological baseline via elevated protein replacement but not any apparent change in steady-state abundance, a result consistent with increased protein synthesis counterpoised by increased degradation [79]. Hence the measurement of half-life is also pursued to identify candidate disease proteins and from which to infer significantly dysregulated biological processes during pathogenesis.

Mapping protein–protein interaction networks

There have been marked improvements in experimental protocols in affinity purification, as well as statistical and data science methods to filter out false positives in pulldown experiments. Chiang et al. recently elucidated the interactome of the protein phosphtase 1 catalytic subunit (PP1c), identifying 78 interacting partners in human heart. The proteomics results found increased binding to PDE5A in paroxysmal atrial fibrillation patients to impair proteins involved in electrical and calcium remodeling, a result that has implications in the understanding and treatment of atrial fibrillation [80]. Waldron et al. identified the TBX5 interactome in the developing heart to discover its interactions with the repressor complex NuRD, elucidating the mechanisms by which TBX5 mutations can influence cardiac development and confer congenital heart diseases. The accretion of public-domain protein–protein interactome data are also serving as a permanent resource that benefits other investigators outside the proteomics field, and in one but many recent examples provided important context to systems genetics experimental data to evidence the involvement of an interacting cilia protein network in congenital heart diseases [81]. More recently, the CoPIT method extends the scope of comparison to degrees of interactions among samples across cell states with more rigorous statistics, and is particularly notable in its suitability for quantifying differential interactomes of membrane proteins in human diseases [47]. Potential protein–protein interactions can now also be predicted in silico and de novo using machine learning algorithms that take in experimental data and auxiliary information [82].
At the same time, there is renewed interest to perform crosslinker studies on a large scale, which in addition to identifying protein–protein interaction partners, can provide information on the topology and protein domains involved in the interactions. Again we note that the development of new proteomics methodologies now necessitates hand-in-hand advances of novel data science solutions almost without exception. An example is the application of chemical cross-linkers in proteomics, which allows the linking of proximal proteins to quantify the degrees and likelihood of protein–protein interactions in their native cellular environment. Cross-linking proteomics experiments are however infeasible without specialized search engines that can consider the combinatorics of crosslinked peptide sequences, and identify interacting proteins whilst controlling for the FDR that result from the quadratic increase of search space [8384].

Outlooks: emergence of proteomics big data

Quantitative shotgun proteomics has developed into a remarkably powerful technology that enables sophisticated questions on cellular physiology to be asked. The total volume of proteomics data generated per year now ranges in the petabytes. This is paralleled by an increasing number of available proteomics datasets in the public domain that can be reused and reanalyzed, with as many as 100 new datasets being made available per month on the proteomics data repository PRIDE [85]. Hence joining next-generation genomics, proteomics has become a veritable source of biomedical “big data”. As our capacity for data generation surges, opportunities for breakthroughs will increasingly come from not how much more data we can generate, but how well we can make sense of the results. As a corollary, the need for proteomics big data solutions is poised to skyrocket in the coming few years, where new resources, tools, and ways of doing science are needed to rethink how best to harness datasets and discern deeper meanings. The production of biological knowledge will involve tools and solutions devised in the field of data science, including those concerning data management, multivariate analysis, statistical learning, predictive modeling, software engineering, and crowdsourcing. Several current limitations and possible future frontiers, out of many, are discussed below:
Despite impressive gains, improvement of protein identification will likely continue to be an area of active research. It is estimated that up to 75–85 % of mass spectra generated in a proteomics experiment can remain unidentified by current data analysis workflows [6286], thus leaving room for continuous growth through better bioinformatics in the near future. Currently the unidentified “junk” spectra are mostly siloed or discarded, thus they constitute a major untapped source of biomedical big data. More inclusive search criteria (e.g., considering non-tryptic cleavage) can enhance identification, but there also exists a substantial portion of spectra that represent bona fide peptides not amenable to existing methods. These include peptides too short to score well in searching algorithms (≤5 amino acids), and peptides that are absent from protein sequence databases, e.g., variant peptides from polymorphisms or unknown splice isoforms. The advent of massive publicly available datasets has opened new avenues to tackle this problem [6263]. For instance, the millions of unidentified spectra that are uploaded to the PRIDE proteomics data repository may be systematically sorted and clustered, then analyzed via more exhaustive search protocols to identify what peptides are commonly present but unidentified across datasets and experiments.
Secondly, advances in protein quantification techniques, via both experimental and computational developments, will likely continue unabated. Many quantification techniques do not take into account the peptides that may become post-translationally modified or otherwise lost in a biological state. Currently, decreases in label-free measured quantity may be confounded by differences in protein modifications, digestion, or ionization, or matrix effects from different samples. For instance, the acquired spectral counts may be inflated by the existence of shared peptides among multiple (documented or undocumented) protein forms [87] as well as the sampling saturation for high-abundance peptides [88]. Statistical approaches pioneered in transcriptomics may be useful which can take into account the many-to-many mappings between proteins and peptides and to reconstruct proteoforms from individual peptide signals.
Lastly, the identification of unknown or unspecified PTMs will likely see continued progress. Because the multiplicity of the possible modification types on a peptide can shift peptide fragment masses combinatorially, they can greatly inflate the number of possible matches. Efforts are underway to develop custom sequence databases and devise new algorithms to extract information from existing cardiovascular proteomics datasets that is currently “hiding in plain sight”. For instance, algorithms can be used to predict peptide fragment intensity to improve peptide identification [89]. To make PTM search computationally tractable, multi-pass or cascade search approaches have been implemented that restrict the possibility of modified peptides to only within proteins that were preliminarily identified in the initial search. To improve peptide identification, “spectral libraries” including library modules for the cardiovascular system have been constructed that contain previously identified spectra, against which new experimental spectra can be directly matched for identification [579091]. Because spectral libraries contain only a small subset of all theoretical sequences, and contain precise peptide fragment intensity in addition to ion masses, library search can lead to faster and more accurate identification.
To summarize, we recall an apt analogy provided by Loscalzo to compare the understanding of cardiac proteome with that of building a house and the genome with that of its floor plan [92]. Genomics kick-started the era of high-throughput omics investigations, but building a house requires more than just the blueprint; the complexity of protein regulation and pathway functions is better approached with proteomics. Technological advances in the last decade have gained tremendous power to discover finer minutiae of the house of the cardiac proteome. Success in the next 5 years will likely come from interfacing big proteomics data and computational approaches to distill regulatory principles and to support diagnostic/prognostic process from seemingly overwhelming information. The notion that existing data contain additional latent information that may be extracted to answer future questions is a fundamental tenet of big data science. Extrapolating from current developments, one can envision sophisticated discovery-driven studies in cardiac biomedicine, where original research projects may be initiated by data scientists using publicly-accessible proteomics datasets to ask new and unanticipated questions. A vibrant data science culture that promotes interactions between data generators and informaticians will facilitate the design and validation of computational methods and promote continued development in proteomics.