Efficiently and accurately analyzing big protein tandem mass spectrometry data sets requires robust software that incorporates state-of-the-art computational, machine learning, and statistical methods. The Crux mass spectrometry analysis software toolkit (http://cruxtoolkit.sourceforge.net) is an open source project that aims to provide users with a crossplatform suite of analysis tools for interpreting protein mass spectrometry data.
Spectrum annotation is a challenging task due to the presence of unexpected peptide fragmentation ions as well as the inaccuracy of the detectors of the spectrometers. We present a deep convolutional neural network, called Slider, which learns an optimal feature extraction in its kernels for scoring mass spectrometry (MS)/MS spectra to increase the number of spectrum annotations with high confidence. Experimental results using publicly available data sets show that Slider can annotate slightly more spectra than the state-of-the-art methods (BoltzMatch, Res-EV, Prosit), albeit 2–10 times faster. More interestingly, Slider provides only 2–4% fewer spectrum annotations with low-resolution fragmentation information than other methods with high-resolution information. This means that Slider can exploit nearly as much information from the context of low-resolution spectrum peaks as the high-resolution fragmentation information can provide for other scoring methods. Thus, Slider can be an optimal choice for practitioners using old spectrometers with low-resolution detectors.
Interpreting the potentially vast number of hypotheses generated by a shotgun proteomics experiment requires a valid and accurate procedure for assigning statistical confidence estimates to the identified tandem mass spectra. Despite the crucial role such procedures play in most highthroughput proteomics experiments, the scientific literature has not reached a consensus about the best confidence estimation methodology. In this work, we evaluate, using theoretical and empirical analysis, four previously proposed protocols for estimating the false discovery rate (FDR) associated with a set of identified tandem mass spectra: two variants of the target-decoy competition protocol (TDC) of Elias and Gygi and two variants of the separate target-decoy search protocol of Kall et al. Our analysis reveals signi ficant biases in the two separate target-decoy search protocols. Moreover, the one of the TDC protocol that provides an unbiased estimate FDR among the target PSMs does so at the cost of forfeiting a random subset of high-scoring spectrum identifications. We therefore propose the mix-max procedure to provide unbiased, accurate FDR estimates in the presence of a well-calibrated scores. The method avoids biases associated with the two separate target-decoy search protocols and also avoids the propensity for target-decoy competition to discard a random subset of high-scoring target identifications