Authors: Diogo B. Lima, Tatiani B. de lima, Tiago S. Balbuena, Ana Gisele C. Neves-Ferreira, Valmir C. Barbosa, Fabio C. Gozzo & Paulo C. Carvalho
State-of-the-art structural mass spectrometry-based proteomics is often accomplished by using cross-linkers to covalently bind two or more amino-acid groups. This strategy is complementary to classical structural biology and therefore broadens the toolset for analyzing protein and protein-protein complex structures. One of the greatest challenges in identifying cross-linked peptides in complex protein mixtures is computationally dealing with the large search space, which grows quadratically with each peptide included in the sequence database. The Spectrum Identification Machine for Cross-Linked Peptides (SIM-XL) software uses an algorithm that overcomes this limitation by capitalizing on experimental features that allow it to effectively address the massive combinatorial problem at hand, and presents the results in a user-friendly manner. Thus, SIM-XL is recommended for studies dealing with protein structure and protein-protein interaction in either simple or complex protein mixtures. SIM-XL also allows the sharing of results through PRIDE by exporting them in the mzIdentML format.
One of the goals of systems biology is to determine how a system works, beginning at the molecular-level characterization of proteins and their interactions up to the level of cellular physiological pathways (1). In a broad sense, the determination of protein structures and protein-protein interactions has an immediate impact on many biological and biotechnological fields, including medicinal chemistry, immunology, and molecular medicine, to name a few. For example, the determination of a protein’s three-dimensional structure allows us to better understand its biological function and consequently opens the doors to the design of new drugs or even provides new insights into how to completely engineer new proteins to fulfill specific biological functions (2). This qualifies the understanding of cell biology, at the atomic level, as a key to answering a number of important biological questions and thus providing a roadmap for numerous biotechnological applications such as the design of new drugs. The current gold-standard methods for determining a protein’s structure are X-ray diffraction (3) and nuclear magnetic resonance (NMR) (4). However, the majority of proteins and their complexes are not amenable to these strategies, as they either do not crystallize, require large amounts of high-purity protein, or the system is just too large to be analyzed (5). These facts make evident the need for developing novel structural approaches that are applicable to a larger number of systems.
Chemical cross-linking coupled to high-resolution mass spectrometry (XL-MS) has become a key method to broaden the toolset for protein structural characterization and in determining protein-protein interactions. During sample preparation, synthetic cross-linkers are included; these covalently link to the side chains of amino acids in proteins and/or their complexes. The sample is then digested, ultimately allowing the cross-linked peptides to be identified by analyzing the mixture using tandem mass spectrometry. The covalently linked peptides carry a very unique piece of information, i.e., distance constrains, which allows for further elucidation of tertiary structures and protein-protein interactions (6,7).
In this protocol we describe the key steps for using the Spectrum Identification Machine for Cross-linked Peptides (SIM-XL), a fast and sensitive XL search engine that is part of the PatternLab for proteomics environment (8), to analyze tandem mass spectrometry data derived from cross-linked peptides. A video demonstrating SIM-XL v1.0 in action is available at http://patternlabforproteomics.org/sim-xl/video.
Download SIM-XL by clicking on the Download button at http://patternlabforproteomics.org/sim-xl.
The following workflow demonstrates how to perform a search using the SIM-XL search engine.
2.1. Execute the Spectrum Identification Machine for Cross-liked Peptides (Figure 1)
Figure 1: Graphical User Interface for the main window of SIM-XL.
2.2. Specify the protein sequence database file by clicking on the Browse button. The file extension has to be .FASTA, *.T-R, or *.T. We refer the reader to Basic Protocol 1: Preparing a sequence database to be searched by Prolucid or the academic Sequest (8) for more on how to generate target-decoy (.T-R) databases compatible with PatternLab.
2.3. Specify a directory containing XL-MS data in any of the following formats: mzML 1.1.0, MS2, MGF, or Thermo RAW. If Recursive Search is checked, all subdirectories will be searched.
2.4. Select one of the pre-registered cross-linkers in the drop-down list. By default, there are five cross-linkers registered:
Optionally including a new cross-linker in the XL library
2.4.1. To register a new cross-linker, click on the Edit button beside the Cross-linker field and the XL library window will pop up (Figure 2).
Figure 2: XL library. A cross-linker can be inserted or removed in this window.
2.4.2. Fill out the fields XL Name, XL Mass Shift (reaction XL mass) in Daltons, Reaction Sites, and optionally, the Modification Mass Shift and Reporter Ions masses.
18.104.22.168. XL Name is a user-defined identifier for the cross-linker.
22.214.171.124. XL Mass Shift is the net mass of the cross-linker that will be added to the peptide masses.
126.96.36.199. Reaction Sites are all the combinations of amino acids that react with the cross-linker. If the cross-linker reacts with the N-terminus, the keyword N-TERM should be included. For example, the entry for the DSS cross-linker should be KK KS SS KN-TERM SN-TERM. Similarly, C-terminal reactivity can be specified using the keyword C-TERM. Note that reaction sites must be separated by a single space. Also, when N-TERM is specified SIM-XL performs an unspecific digestion on the first 20 amino acids of all database entries, starting with Methionine; this is done to address the possibility of signal peptides, frequently present in public databases.
The optional Modification Mass Shift field defines an artificial modification caused by the cross-linker. For example, DSS/BS3 can react with a single lysine residue, generating the so-called dead-end modification. If this field is filled out and the Dynamic DB Reduction box (see item 2.10.5) is checked, SIM-XL will use Comet (12) to identify the modified peptides and use this information to only consider theoretical cross-linked peptides that had at least one chain identified with the modification. Consequently, this reduces the search space and is thus indicated when searching complex samples. If the _Modification Mass Shift field is left blank, SIM-XL will always consider all possible combinations of cross-linked peptides.
The optional Reporter Ions field defines the m/z of fragments that are specific to cross-linked peptides (13). If these _m/z values are given, SIM-XL will only search MS/MS spectra containing at least one of the corresponding fragments, thus speeding up the search; otherwise, SIM-XL will search all spectra.
2.4.3. To finish the new cross-linker inclusion, click on the Update button and then on the OK button. This will make the new cross-linker available within the library and usable for searching.
2.4.4. In order to delete a cross-linker entry from the library, the whole line must be selected (by clicking on the row’s header cell) and then the DEL key must be pressed, followed by a click on the Update button and another on the OK button.
2.5 Select an enzyme from the drop-down enzyme list. By default, Trypsin and Lys-C are registered.
Optionally including a new enzyme in the Enzyme library
2.5.1. To include a new enzyme, click on the Edit button beside the Enzyme field; the enzyme library window will pop-up (Figure 3). In an empty line, complete the corresponding fields with the enzyme’s name and a regular expression encoding the enzymatic cleavage.
Figure 3: Enzyme library. An enzyme can be inserted or removed in this window. A regular expression is required to specify the cleavage sites of a new enzyme. For example, the regular expressions for Trypsin and Lys-C are ‘KR’ and ‘[K]’, respectively. For more on building regular expressions we refer the reader to http://www.regular-expressions.info/.
2.5.2. In case one wishes to remove an enzyme from the library, the whole line must be selected (by clicking on the row’s header cell) and then the DEL key must be pressed, followed by a click on the Update button and another on the OK button.
2.6. Choose Enzyme Specificity from the drop-down list; the options are: Semi-Specific or Fully Specific. The latter refers to peptides originating from a complete digestion (i.e., with enzyme cleavage sites at both the C- and the N-terminus). Semi-specific means that the constraint of having a cleavage site at one end is lifted. For example, in the sequence R.APBCK.A, where “.” denotes the occurrence of cleavage, selecting Semi-Specific will make SIM-XL consider A, AP, APB, APBC, K, CK, BCK, PBCK, and APBCK. Otherwise (i.e., if Fully Specific is selected), the search space is limited to APBCK.
2.7. Specify the Precursor and Fragment ppm tolerances.
2.8. Check the Deconvoluted MS/MS option if the spectra in the data files are deconvoluted (i.e., decharged and de-isotoped). We refer the reader to YADA as a tool for deconvoluting mass spectra (14).
2.9. To include a modification from the Modification library, select it from the drop-down list and then click on the OK button found in the Modifications group box (Figure 4).
Figure 4: Select a pre-defined modification or add a new one. To edit a pre-defined modification, click on the Edit button.
Optionally including new modifications in the Modification library
2.9.1. To include a new modification or edit an existing one, click on the Edit button.
188.8.131.52. A new window will open. To include a new modification, fill out the fields Name, monoisotopic Mass Shift, and Amino acid(s) to can carry the modification.
184.108.40.206. To delete a modification from the library, select the whole line by clicking on the row’s header cell, then press the DEL key, click on the Update button, and then on the OK button.
2.9.2. The user should indicate whether the modification is a variable one and whether it applies to the C-term and/or the N-term by checking the corresponding boxes. For example, if not all Methionines in the sample are expected to be oxidized, the modification should be checked as variable; however, for modifications that are expected in all occurrences of the amino acid, such as, say, carbamidomethylation of cysteine, the variable option should remain unchecked.
2.9.3. To remove a modification, select the desired one, then click on the Remove Modification button and confirm the exclusion.
2.10. The XL Advanced Parameters tab allows access to various parameters that are not usually required to be changed for XL-MS analyses. In this tab, the parameters are divided into three groups: SIM-XL Parameters, Dynamic DB Reduction (which uses Comet (12) for performing a preliminary search), and _Common Parameters (indicating that the parameters belong to both the SIM-XL and the Comet search engines).
2.10.1. Number of Isotopic Possibilities: The precursor mass stored in raw data files may not correspond to the monoisotopic peak. This option allows the software to find the correct monoisotopic peak, which is required to identify the molecule but at the cost of opening up the search space. So, for example, for a peptide with a monoisotopic mass of 4000 Da, the most intense peak in the isotopic envelope is M+2 (that is, 4002 Da), which will most likely be selected as the precursor mass. If the number of possibilities is set to three, SIM-XL will search this MS/MS spectrum considering the precursor masses 4002, 4001, and 4000, plus or minus the given ppm tolerance. In this example, the correct monoisotopic precursor mass is 4000 and thus can be correctly identified by SIM-XL. If a high number of isotopic possibilities is set, the search space will increase accordingly and impact SIM-XL’s sensitivity negatively.
2.10.2. Minimum AA Residues per chain: Minimum number of amino acids a peptide should have to be considered a candidate for cross-linking.
2.10.3. Maximum results to report: Number of top-scoring XL candidates to be reported for each queried spectrum.
2.10.4. Intra-link charge: Maximum charge of precursor ions to be searched in an intra-molecular cross-link candidate. All candidates are also considered for the inter-molecular searches.
Dynamic DB Reduction parameters
2.10.5. Enable: If enabled, SIM-XL will run Comet (12) to perform a preliminary search to identify peptides containing the modification specified in the cross-linker definition as modification mass shift (see item 2.4.2). These identifications are used to generate a theoretical cross-linked peptides database in which all entries contain at least one chain previously identified with the modification.
2.10.6. XCorr Threshold: Minimum Comet XCorr value for identifying peptides containing the modification mass shift specified in item 2.4.2.
2.10.7. Minimum number of peptides: Minimum number of peptides required to include a protein to be later used during the generation of the theoretical cross-linked peptide sequence database.
2.10.8. Maximum missed cleavages: Maximum number of missed cleavages allowed during the theoretical digestion of the sequence database.
2.10.9. Minimum and Maximum MH: Minimum and maximum masses of singly-charged peptide ions to be searched.
2.10.10. Peaks Matched cutoff: Minimum number of matching fragments between the theoretical and experimental spectra. Identifications not satisfying this constraint will be discarded.
2.10.11 Merge High Resolution Spectra: Enabling this option will let the search engine merge two or more high-resolution spectra that likely belong to the same precursor. The motivation is that several MS/MS spectra may be acquired from the same precursor during its elution peak; the merged spectrum will have a better signal-to-noise ratio than the individual spectra.
220.127.116.11 Chromatogram Tolerance (seconds): Maximum time difference, in seconds, between tandem mass spectra having the same precursor mass to be merged.
Note: The Log field reports notes on the progress of the search.
Figure 5: XL Advanced Parameters tab.
2.11 Once all parameters have been set, we strongly recommend saving them for future searches. This is accomplished by selecting Save SIM-XL Params from the File menu or pressing CTRL + S (Figure 6).
2.12. To load a previously saved search parameter file, select Load SIM-XL Params from the File menu or press ALT + L (Figure 6).
Figure 6: Save or Load SIM-XL params.
2.13. To begin searching, click on the GO button in the XL Search tab.
3. Exploring the results
Note: At this point we recommend saving the results by selecting Save results from the File menu or pressing CTRL + S (Figure 7).
Figure 7: Results Browser’s File menu. Here the user can access many features, such as loading or saving the search results, exporting the 2D-Map to an image or PDF file, exporting the Protein Heat map to an image or Excel© (XLS) file, and printing or exporting the results to a spreadsheet with all protein interactions and their hit details.
3.1.1. The 2D-Map is an interactive map showing all the cross-links identified with a score above the cutoff value given in the Score field. In this map (Figure 8), each protein is represented as a rectangle of size directly proportional to its sequence length, with residue numbers marking the ticks at the bottom. The protein’s ID is shown outside the rectangle, on the left. Each intra-protein cross-link is represented as a red arc. Likewise, inter-protein cross-links are given in blue straight lines. The position of each linker corresponds to the amino-acid number. The user can customize the view by left-clicking on the rectangles and dragging them around, as well as zoom in or out using the zoom bar. By letting the mouse pointer hover over the cross-linker representation, a window will pop up showing linker details such as the linking amino acids and their positions.
3.1.2. By right-clicking on a cross-linker representation, a pop-up window will be displayed showing all the identified cross-links, with corresponding scan numbers, scores, and charge states. The user can then left-click on a desired identification to access the spectrum with the Spectrum Viewer (see item 4).
3.1.3. The 2D-Map can be exported as a PNG, TIFF, or JPG image, or as a PDF file: On the File menu, select 2D-Map, then Save Image (ALT + I), or select Export 2D-Map (ALT + R) (Figure 7).
Figure 8: Protein-protein interaction map (2D-Map).
3.2. Loading results
3.2.1. SIM-XL can load results in the SIM-XL (.simxlr) or mzIdentlML 1.2 draft file formats. This can be accomplished in several ways, the easiest one being to double-click on a SIM-XL results file. If the *Result Browser window is open, select Load Results from the File menu or press ALT + S, as seen in Figure 7. Otherwise, if the main window is open, select Load SIM-XL Results from the File menu (or press ALT + R), as seen in Figure 6.
We note that mzIdentML results can only be loaded within SIM-XL by accessing the Load Results option, which will open another windows where the user can specify both required files, the Result file (.mzIdentML) and the *Data file (e.g, *.mzML, *.MGF, *.MS2, or *.RAW).
Figure 9: Input file window. SIM-XL accepts the mzIdentML format, in addition to its own format (simxlr).
3.3. Dynamic Result Report
3.3.1. A dynamic report is made available by clicking on the Dynamic Result Report tab (Figure 10). The user can sort/search the results according to user-specified criteria. By double-clicking on an entry, the Spectrum Viewer will open, enabling access to the spectrum for the identification in that entry (see item 4).
Figure 10: Dynamic Result Report.
18.104.22.168. Filtering results
22.214.171.124.1. ScanNumber: In case this field is not empty, only spectra whose scan numbers match that of this field will be displayed.
126.96.36.199.2. Score: Only results containing identification scores greater than or equal to this value will be displayed.
188.8.131.52.3. ppm: Only results containing a ppm less than or equal to this value will be displayed.
184.108.40.206.4. Sequence: Only results from peptides containing the sequence input to this field will be displayed. The user can further specify whether only results from intra-link or inter-link peptides/proteins are to be displayed.
220.127.116.11.5. Assessment: Only results containing a Personal Assessment equal to this value will be displayed.
All criteria specified in these fields will be reflected in all tabs (2D-Map, Dynamic Result Report, and Protein Heat map).
Fields in the Dynamic Result Report
3.3.2. The Peptide Sequence field shows the search result’s identified peptides. The amino acids interacting with the cross-linker are shown in a bold typeface. Double-clicking on this field makes the Spectrum Viewer (see item 4) pop up, enabling the user to assess the spectrum that resulted in the respective identification.
3.3.3. The Protein 1 and Protein 2 columns display the protein(s) that contain the identified peptide sequence(s). The remaining proteins having conserved regions containing the peptide(s) can be assessed by double-clicking on one of these columns (Figure 11).
Figure 11: Proteins containing identified peptides window. The header displays the identified peptides: TDEQALLSSILAKTASNIIDVSAADSQGMEQHEYMDR and LAVLSSSLTHWKK. Below, the protein that contains these sequence(s) (PROTEIN1) is listed.
3.3.4. The Upload Spectra column is part of a global effort for improving cross-linker identification scoring functions. By checking beside the desired spectra and then selecting Send Spectra to Server from the Utils menu (or pressing ALT + S), the user will donate the identifications and spectra for further research on the topic.
3.3.5. The Personal Assessment column allows the user to input a personal assessment on the quality of each identification. This is accomplished by selecting from the drop-down list one of the five choices ranging from Excellent to Poor.
3.3.6. At this point, we recommend saving the results once again so that the personal assessments can be included. This is done by selecting Save Results from the File menu or pressing CTRL + S.
3.4. Protein Heat map
3.4.1. The Protein Heat map (Figure 12) displays regions where inter-protein cross-linkers were identified. To generate such a map, two proteins must be selected by using the horizontal-axis and the vertical-axis drop-down lists. The heat map can be limited to desired amino acids by selecting them in the Reaction Site field.
Figure 12: A Protein Heat map showing the interaction regions defined by cross-linkers. The red scale is associated with the number of identified XL spectra. By clicking on a cell, all identifications supporting that interaction will be displayed.
3.4.2. The Protein Heat map can be exported as an image or a spreadsheet file containing the information about the interactions. To export the map as an image, select Save Image, then Protein Heat map, from the File menu (or press CTRL + I); to export it as a spreadsheet, select Spreadsheet from the Export Data menu (or press ALT + T), or Hit Details (i.e., information of the identifications contained in each cell) (ALT + H) (Figure 7).
3.5. Utils menu
3.5.1. Report Fusion: This option allows merging several SIM-XL results into a single report. To accomplish this, select Report Fusion from the Utils menu (or press ALT + F) and select all files to be joined.
3.5.2. Custom Report Results: This option allows the addition or removal of columns in the Dynamic Result Report. For this, select Custom Report Results from the Utils menu (or press ALT + C). Following that, a new window containing all columns that can be included or removed will be displayed. After checking beside the features of interest, click on the OK button.
4. Spectrum Viewer
4.1. The Spectrum Viewer (Figure 13) displays an annotated XL mass spectrum. The Spectrum View tab allows the user to browse the spectrum, zoom in and out, and easily view which peak was attributed to the corresponding fragment. To zoom in, click and drag the mouse over the desired m/z range (Figure 14). To zoom out, double left-click.
Figure 13: XL Spectrum Viewer. The Spectrum View tab allows the user to browse the spectrum, zoom in and out, as well as easily view which peaks were attributed to which series. A ppm deviation plot is available above the annotated mass spectrum.
Figure 14: Zoom-in on a specific m/z range of the XL mass spectrum.
4.2. A ppm deviation plot is available above the annotated mass spectrum (Figure 13). This plot displays the deviation between the theoretical and experimental peaks, in ppm, along the m/z range. The continuous blue line represents the Random Sample Consensus (RANSAC), which is a linear regression of matched peaks. The blue dotted lines represent three standard deviations from the regression line. To save this plot, right-click on the image and choose between Copy to clipboard or Save.
4.3. The Peptide Annotation tab (Figure 15) shows a fragmentation diagram of the cross-linked peptide. The plot can be saved by right-clicking on the image and choosing between Copy to clipboard or Save.
Figures 15: The Peptide Annotation tab displays the fragmentation diagram of the cross-linked peptides.
4.4. The Spectrum Prediction Parameters tab (Figure 16) provides a table showing all theoretical fragments and their assignments when matched. Matching criteria are shown in the panel on the left. Changes in these parameters, followed by pressing the Plot button, will update the assignments. These features allow the user to verify, for example, the effects of changing the cross-linker position or even, say, to evaluate the impact on the score of oxidizing a methionine or even changing the sequence of the matched peptide(s).
4.4.1. Peptide Sequence 1 and 2: These are the sequences of α and β chains. For intra-links, fill out the Peptide Sequence 1 field only. Any modification mass shift must be enclosed in parentheses after the modified amino acid. For example: oxidation of methionine should be input as ‘M(15.9949)’.
4.4.2. Position XL 1 and 2: These are the positions of the cross-linked residues in both chains. For intra-links, both fields correspond to the cross-linking positions in the α chain.
4.4.3. Deconvoluted MS/MS: Check this option if the spectrum is deconvoluted (see item 2.8).
4.4.4. XL Mass: The cross-linker mass shift (see item 18.104.22.168).
4.4.5. ppm: The tolerance of each spectrum peak match.
4.4.6. Precursor Charge and Precursor Mass: The charge and mass of the precursor ion.
4.4.7. Ion Series: Check the series to be considered by the Spectrum Viewer. For inter-links, both α and β series should be checked.
4.4.8. Click on the Plot button to update the Spectrum Viewer.
4.4.9. The Load Example button loads an example spectrum.
Figure 16: Spectrum Prediction Parameters tab. The user can adjust parameters to check assignments.
4.5. The Spectrum Plotting Parameters tab (Figure 17) allows the user to enter an individual experimental mass-spectral peak list. The user can also specify the XL reporter ions.
Figure 17: Spectrum Plotting Parameters tab. The user can add individual mass-spectral peak lists to visualize the spectral assignments.
4.6. To save the annotated XL mass spectrum, select Save Spectrum from the File menu or press CTRL + S. An image can be saved by selecting Export Image, then Spectrum Image, from the File menu or pressing CTRL + I. To load the annotated XL mass spectrum, select Load Spectrum from the File menu or press ALT + L.
4.7. The user can customize the XL mass spectrum annotation by selecting Custom Spectrum View from the Utils menu or pressing CTRL + L. A new window will open with several options, as shown in Figure 18.
Figure 18: Custom Spectrum View. Spectrum annotations can be customized by checking the option in this menu.
4.8. By selecting Send Spectrum to Server from the Utils menu (or pressing ALT + S), SIM-XL will upload the annotated spectrum to a server and thus contribute to a global effort aiming towards creating more sophisticated cross-linker scoring functions through machine learning.
The authors thank FAPESP, FAPERJ, CAPES, Universal CNPq, Microsoft Research – Microsoft Azure Research Award, Programa Estratégico de Apoio à Pesquisa em Saúde (PAPES), and Fundação Oswaldo Cruz for financial support. They also thank the PRIDE Team for working together with them to enable SIM-XL to support the next version of mzIdentML.
Diogo B. Lima & Paulo C. Carvalho, Laboratory for Proteomics and Protein Engineering, Carlos Chagas Institute, Fiocruz, Paraná, Brazil
Tatiani B. de lima & Fabio C. Gozzo, Dalton Mass Spectrometry Laboratory, University of Campinas, São Paulo, Brazil
Tiago S. Balbuena, College of Agricultural and Veterinary Sciences, State University of São Paulo, Jaboticabal, São Paulo, Brazil
Ana Gisele C. Neves-Ferreira, Laboratory of Toxinology, Oswaldo Cruz Institute, Fiocruz, Rio de Janeiro, Brazil
Valmir C. Barbosa, Systems Engineering and Computer Science Program, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil
Source: Protocol Exchange (2015) doi:10.1038/protex.2015.015. Originally published online 13 February 2015.