The very first time you receive your Excel file(s) summarizing the search results, you might feel confused. No worries, based on my experience, you are not alone! This post will guide you through.
My file naming system is as follows: date-sample name-number in the queue. For example, a positive control for your sample analyzed on January 29, 2014 would be named “14-01-29-CTRL-04”
You should receive twice as many files as the number of samples you submitted. Your sample is always injected after a blank run. This is done to account for potential carryover from previous sample injections, which is unavoidable in a service facility environment. Thus, “14-01-29-blank-01” precedes your first sample, “14-01-29-TNL1-02” and so on. I may also include my standard data so that you could see what type of data is obtained using a pure standard. Blanks, samples, your controls, and my standards are always run using the same instrument parameters.
When you open your Excel file, you should see a list of proteins each of which has the following parameters:
UniprotKB protein accession number, the unique identifier assigned to the protein by the FASTA database used to generate the report. To find the NCBInr equivalent, copy the accession number and paste it into NCBInr search, selecting ‘protein’ from the drop-down list
UniprotKB protein description. Provides the name of the protein exclusive of the identifier that appears in the Accession column.
The protein score, which is the sum of the scores of the individual peptides. I use SEQUEST search algorithm, for which the score is the sum of all peptide Xcorr values above the specified score threshold. The score threshold is calculated as follows: 0.8 + peptide_charge × peptide_relevance_factor where peptide_relevance_factor is a parameter with a default value of 0.4. For each spectrum and sequence, the Proteome Discoverer application uses only the highest scored peptide. When it performs a search using dynamic modifications, one spectrum might have multiple matches because of permutations of the modification site. (The higher the better)
The percent coverage calculated by dividing the number of amino acids in all found peptides by the total number of amino acids in the entire protein sequence. (The higher the better)
The number of identified proteins in the protein group of a master protein. Proteins are grouped based on sequence homology and/or isoforms as explained below.
# Unique peptides
The number of peptide sequences unique to a protein group.
The number of distinct peptide sequences in the protein group.
The total number of identified peptide sequences (peptide spectrum matches) for the protein, including those redundantly identified. (The higher the better)
# AAs, MW [kDa], calc. pI
The calculated parameters of the protein based on the amino acid sequence in the FASTA database used to generate the report. The Proteome Discoverer application calculates the molecular weight without considering post-translational modifications. If you have separated proteins by molecular weight by PAGE, you can use the protein’s molecular weight as a rough constraint to estimate whether it is reasonable to identify a particular protein in a certain fraction that was analyzed.
Next, expand the sheet by clicking on [+] which opens the column parameters for the associated peptides.
A2 or other Letter/number, first column
A top level confidence achieved with the peptide sequence: high confidence, medium confidence, or low confidence. I send you only the high-confidence data, unless instructed otherwise.
The sequence of amino acids that compose the peptide.
The total number of identified peptide sequences (PSMs) for the protein, including those redundantly identified. (The higher the better)
Displays the number of proteins in which this peptide is found
# Protein Groups
The number of protein groups in which this peptide is found.
MS/MS-based proteomics studies are based on peptides. However, deducing protein identities from a set of identified peptides could be difficult because of sequence redundancy, such as the presence of proteins that have shared peptides. These redundant proteins are automatically grouped and are not initially displayed in the search results report.
The proteins within a group are ranked according to the number of peptide sequences, the number of PSMs, their protein scores, and the sequence coverage. The top-ranking protein of a group becomes the master protein of that group. By default, only the master proteins are displayed on the Proteins page.
A protein group consists of the following:
- One master protein that is identified by a set of peptides that are not included (all together) in any other protein group.
- All proteins that are identified by the same set or a subset of those peptides.
The # Proteins column on the Proteins and Peptides pages of the results report displays the number of identified proteins in the protein group of a master protein.
Protein Group Accessions
The unique identifiers (accessions) of all master proteins from all protein groups that include this peptide sequence. Since I normally group proteins by selecting the “Consider Leucine and Isoleucine as Equal” option, this column also lists identifiers from master proteins that may include this specific peptide sequence. The identifiers displayed in the Protein Group Accessions column are the same as those displayed in the Accession column on the Proteins page.
The static and dynamic modifications identified in the peptide. I always use iodoacetamide for Cys alkylation, and this static modification will be in your search results as ‘Carbamidomethyl’ unless you modified your Cys residues with a different reagent. Met oxidation and Asn and Gln deamidation are common dynamic modifications.
The normalized score difference between the currently selected PSM and the highest-scoring PSM for that spectrum. (The lower the better)
A search-dependent score. It scores the number of fragment ions that are common to two different peptides with the same precursor mass and calculates the cross-correlation score for all candidate peptides queried from the database by SEQUEST searches. (Jimmy K. Eng, Ashley L. McCormack, and John R. Yates, III; An Approach to Correlate Tandem Mass Spectral Data of Peptides with Amino Acid Sequences in a Protein Database. J. Am. Soc. Mass Spectrom. 1994, 5, 976-989) (The higher the better)
The probability score for the peptide. This score is an assessment of the probability that the reported match is a random occurrence. A lower probability score indicates a better match. (The lower the better)
The charge state of the peptide, z (z is always greater than 1 as set during the MS analysis).
Calculated m/z of the peptide with z = 1. It should be “MH+ [m/z]”, not [Da].
delta M [ppm]
Mass measurement error in parts per million, ppm (The lower the better)
The peptide’s retention time during chromatographic separation.
# Missed Cleavages
The number of cleavage sites in a peptide sequence that a cleavage reagent (enzyme) did not cleave. This number excludes cases where an amino acid (e.g. Pro) inhibits the cleaving enzyme (e.g. trypsin).
And this is just the beginning. Stay tuned for more of the proteomics data fun!