Authors: Burcu Bakir-Gungor & Osman Ugur Sezerman
The identification of the variants that explain familial risk of a specific disease is important since it enables the development of genetic risk prediction tests, diagnosis tools and therapeutical applications. One possible reason of multifactorial diseases is the alterations in the activity of biological pathways, where a series of mutations occur in distinct genes. While each of these variations extends slightly the likelihood of having the disease, they work together to give birth to the perturbations in normal biological processes. We provide a protocol (termed PANOGA, Pathway and Network Oriented GWAS (Genome-wide association study) Analysis) to devise functionally important pathways through the identification of genes within these pathways, where these genes are targeted by single nucleotide polymorphisms (SNPs) obtained from the GWAS analysis. Additionally, PANOGA helps to identify other disease related genes, not targeted by the SNPs, which are also located within these affected pathways. The program accepts tab delimited or excel file containing SNP rsIDs vs. genotypic p-values and is available at: http://akademik.bahcesehir.edu.tr/~bbgungor/panoga_protocol.zip
In this protocol, starting with a list of SNPs found to be associated with a disease in a GWAS, we propose a novel methodology to determine disease related (SNP targeted) pathways through the identification of SNP targeted genes within these pathways. Multiple factors act on complex diseases. Since each factor would have modest effect on the disease development mechanism, it is challenging to identify significant individual factors. One possible reason of multifactorial diseases is the alterations in the activity of biological pathways. In this method, we hypothesize that these factors are crippling similar pathways in individuals. That’s why pathways have higher explanatory power towards understanding disease development mechanism. To this end, our methodology starts with functionalization of several significant SNPs to identify effected genes. We then map these genes to a protein-protein interaction (PPI) network and determine the connected subnetworks targeted by the SNPs. Next, we find the KEGG pathways in these subnetworks and determine the significance of the modifications on these pathways. The pathways are ranked according to the significance scores and are referred as the SNP targeted pathways. The protocol is outlined in Figure 1.
A computer with Windows or Linux OS and internet access.
Hardware requirements: We recommend a 1 GHz CPU or higher, a high-end graphics card, 500MB of available hard disk space, at least 1 GB of free physical RAM and a minimum screen resolution of 1,024×768.
Java 2 platform: Standard Edition, version 5.0 or higher (Java SE 5 or higher). (http://java.sun.com/javase/downloads/index.jsp).
Data files: This protocol begins with a GWAS dataset containing SNP rsIDs vs. genotypic p-values in a tab-delimited text file or excel file, as detailed in Figure 2 . As a result of the preprocessing step of PANOGA, four additional files in SPOT, F-SNP, SNPnexus and SNPinfo input file formats are created. A sample PANOGA input file and SPOT, F-SNP, SNPnexus and SNPinfo input files are made available in PANOGAprotocol/data/ as samplepanogainput.txt, samplespotinput.txt, samplefsnpinput.txt, samplesnpnexusinput.txt, samplesnpinfo_input.txt). In addition to the sample GWAS data, several additional data files are available for readers wishing to follow this protocol as a tutorial:
humanPPI.sif: contains a protein–protein interaction (PPI) network in a sif file format, as detailed in Figure 2 . This file illustrates the SIF, which offers a straightforward means to import networks into Cytoscape as text.
samplespotfsnp_snpnexus.pvals: contains SPOT and F-SNP weighted p-values (Pw-values) for each SNP associated gene. Each of these two Pw values combine functional information of a SNP and the genotypic p-value of a SNP, that is found to be significant in GWAS.
Cytoscape: Cytoscape is an open source network data integration, analysis, and visualization platform. Subnetwork identification and functional enrichment steps of PANOGA protocol are realized by Cytoscape plugins. Hence, to follow PANOGA protocol, users need to install Cytoscape version 2.6.3 by on a local computer by following the steps in Box 2 of the Cytoscape paper, published in Nature protocols. Although Cytoscape has newer versions, jActiveModules and ClueGO plugins are verified to work in Cytoscape version 2.6.3.
External tools: PANOGA utilizes four external web-servers to functionalize SNPs, i.e., SPOT, F-SNP, SNPnexus, SNPinfo; jActiveModules plugin of Cytoscape to identify sub-networks; ClueGO plugin 60 of Cytoscape 68 for functional enrichment of the identified sub-networks. All of these web-servers, programs and plugins are freely available for academic use.
PANOGA: PANOGA is composed of nine consecutive steps, plus a preprocessing step. The preprocessing step of PANOGA is realized by a java script (createpanogainput.jar). SNP functionalization steps of PANOGA are realized via sending the input files into four different web-servers (SPOT, F-SNP, SNPnexus, SNPinfo). The subnetwork identification step of PANOGA is realized by the jActiveModules plugin of Cytoscape. The remaining steps are performed via running java executable programs. The above-mentioned jar files of PANOGA can be downloaded at: http://akademik.bahcesehir.edu.tr/~bbgungor/panoga_protocol.zip
1.Set up necessary environment to run PANOGA (as detailed in EQUIPMENT SETUP).
2.Download the PANOGA files at http://akademik.bahcesehir.edu.tr/~bbgungor/panoga_protocol.zip. Unzip the downloaded PANOGAprotocol.zip file and extract it. The executable jar files of PANOGA are found at: PANOGAprotocol/.
Preprocess GWAS data
3.Pick a disease name for your project, which can be any disease name (e.g., diabetes), not necessarily a standard OMIM disease name. In the following steps of PANOGA procedure, we will refer to this disease name as $DISEASENAME. CRITICAL STEP Do not use space in the $DISEASENAME since it will corrupt the further steps of PANOGA procedure.
4.Create a folder with your disease name under PANOGAprotocol/data/ and under PANOGAprotocol/out/ via typing the following commands:
Replace $DISEASE_NAME above with the disease name that you specified in Step 3.
5.Format GWAS results input file following the instructions in Figure 2 , and save this file under PANOGAprotocol/data/$DISEASENAME/ using any input file name. e.g., PANOGAprotocol/data/diabetes/diabetespanogainput.txt or bipolargwasresult.xls. samplepanogainput.txt file is also provided under: PANOGAprotocol/data/sample/.
6.Run the java script “createpanogainput.jar” to create four separate input files that will be used in SNP to gene assignment and SNP functionalization steps of PANOGA:
Replace $INPUTFILENAME with your input file name, e.g. (diabetespanogainput.txt), $DISEASENAME with your disease name and $PVALUETHRESHOLD with genotypic p-value threshold that you would like to use to restrict your SNPs based on their significance for disease. The default $PVALUE_THRESHOLD is 0.05.
java -jar createpanogainput.jar $INPUTFILENAME $DISEASENAME $PVALUETHRESHOLD
e.g. java -jar createpanogainput.jar samplepanogainput.txt sample 0.05 This run generates $DISEASENAMEspotinput.txt, $DISEASENAMEfsnpinput.txt, $DISEASENAMEsnpnexusinput.txt, $DISEASENAMEsnpinfoinput.txt files under PANOGAprotocol/data/$DISEASENAME.
CRITICAL STEP Using an input filename with an extension other than .txt or .xls interferes this step.
Assign SNPs to Genes
7.PANOGA procedure uses SPOT webserver 49 to assign SNPs to genes. Go to the SPOT webserver at: https://spot.cgsmd.isi.edu/submit.php.
8.Click into “Upload SNP File” button; select SPOT input file, i.e. $DISEASENAMEspot_input.txt.
9.Change “Maximum SNPs to output:” parameter to 50,000 in SPOT webserver.
10.If your $PVALUE_THRESHOLD (from Step 6) is different than 0.05, change it in the “p-value threshold: “ parameter of SPOT webserver.
11.Under “Linkage Disequilibrium (LD) options” select the appropriate HAPMAP sample among the available options in SPOT webserver.
12.Click into “Run” button and download the result under “Primary Results” section. Save the SPOT output as Tab-delimited file under PANOGAprotocol/data/ $DISEASENAME/$DISEASENAMEspot_output.txt.
13.At this step, the users need to choose one of the following two options: option A to proceed with the full PANOGA procedure, including network oriented stages and functional information of SNPs; option B to proceed with only pathway oriented steps of PANOGA procedure. We highly recommend the users to follow the full PANOGA procedure (option A).
java -jar parsespotoutput.jar $DISEASE_NAME
This run will create the gene symbol file ($DISEASENAMEpartialpanogagene_ symbols.txt) under PANOGAprocedure/ClueGO/data/ and $DISEASENAME partialpanogagene2snp.txt file under PANOGAprocedure/data/ $DISEASE_NAME/.
Replace $DISEASE_NAME below with the disease name that you specified in Step 3.
java -jar ClueGOv1.4.command-line.jar -props clueGO.props -file1 data\$DISEASENAMEpartialpanogagenesymbols.txt -analysis-name $DISEASENAMEpartial_panoga -out out
At the end of this step, enrichment results of the gene symbols are saved under PANOGA_procedure/ClueGO/out/
(iii) Run the java script “analyzecluegooutput.jar” to create SNP targeted pathway lists and gene list for identified SNP targeted pathways.
java –jar analyzecluegooutput.jar $DISEASE_NAME
At the end of this step pathway based lists and gene list are created as explained in the “Anticipated Results” section and “PANOGA’s Application to Human Complex Diseases” subsection of Introduction section.
Install Cytoscape and its plugins
14.Install Cytoscape version 2.6.3 by following its installation guide 69. Follow Cytoscape installation instructions to get the executable file. CRITICAL STEP Although Cytoscape has newer versions, jActiveModules and ClueGO plugins are verified to work in Cytoscape version 2.6.3.
15.Install jActiveModules and ClueGO version 1.4 plugins of Cytoscape. These plugins should be installed into Cytoscape_v2.6.3/plugins/ using the following options: option A to install jActiveModules plugin; option B to install ClueGO version 1.4 plugin:
Obtain Functional Information of SNPs
16.PANOGA procedure utilizes SPOT 49, F-SNP 51, SNPnexus 73 and SNPinfo 50 webservers to functionalize SNPs. SNP functional information through SPOT web-server 49 is already obtained in the previous step while assigning SNPs to genes. Run “runfsnp.jar” to obtain SNP functional information from F-SNP webserver 51:
Replace $DISEASE_NAME with the disease name that you specified in Step 3.
java -jar runfsnp.jar $DISEASENAME This step will save the F-SNP output into PANOGAprocedure/data/$DISEASENAME/ $DISEASENAMEfsnpoutput.txt.
17.Go to the SNPnexus webserver at: http://www.snp-nexus.org/. Under “Batch Query” option, Browse SNPnexus input file, i.e. $DISEASENAMEsnpnexus_input.txt.
18.Under “Annotation Categories”-> “Regulatory Elements”, select ”Conserved Transcription Factor Binding Sites (TFBS)” option and click “Run” button.
19.Download the result under “Regulatory Elements” section via clicking into TXT icon. Save the SNPnexus output as text file under PANOGAprocedure/data/$DISEASENAME/$DISEASENAMEsnpnexus_output.txt.
20.Go to the SNPinfo webserver at: http://snpinfo.niehs.nih.gov/snpfunc.htm. Browse and upload SNPinfo input file, i.e. $DISEASENAMEsnpinfo_input.txt.
21.Click “Submit” button and download the results via clicking into “Export To Excel” button under “SNP Function Prediction Results”. Save the SNPInfo output as csv file under PANOGAprocedure/data/$DISEASENAME/$DISEASENAMEsnpinfo_output.csv.
Prepare the Gene Attributes data
22.PANOGA needs the attributes file (in .pvals format) to identify the sub-networks (using jActive Modules plugin 59). This file has two weighted P-values (SPOT Pw and F-SNP Pw values) for each gene, where the weighted P-value combines the genotypic p-value of a SNP with the functional information of a SNP that is associated with the gene. The following steps of the PANOGA procedure will create an attributes file similar to the provided samplepanogaspotfsnp.pvals file at PANOGAprocedure/. Run “combinespotfsnp.jar” to combine SPOT and F-SNP output files: Replace $DISEASE_NAME with the disease name that you specified in Step 3.
java -jar combinespotfsnp.jar $DISEASE_NAME
23.Run “incorporatesnpnexus.jar” to incorporate functional information from SNPnexus. Replace $DISEASE_NAME with the disease name that you specified in Step 3.
java -jar incorporatesnpnexus.jar $DISEASENAME This run will create the gene attributes file ($DISEASENAMEspotfsnpsnpnexus.pvals) under PANOGAprocedure/data/$DISEASE_NAME/.
Obtain network data
24.Decide which human protein-protein interaction (PPI) dataset you would like to use as your initial network—follow option A to use the default human PPI network or option B to use a customized PPI network.
Load network data
25.Start Cytoscape via following option A for Windows users, option B for Linux users.
26.Decide how you would like to load network data into Cytoscape. Cytoscape allows to import networks from a local or remote computer, or from Web Services—follow option A to import a network file from a local computer, option B from a remote computer or option C to use Web Services. We recommend PANOGA users to follow option A to load network data.
Import gene attributes
27.Assign values (two attributes for each identified gene) to nodes (genes) using File->Import->Attribute/Expression Matrix commands of Cytoscape and selecting the gene attributes file ($DISEASENAMEspotfsnpsnpnexus.pvals) that is created in Step 23. A sample gene attributes file (samplespotfsnpsnpnexus.pvals) is also provided at PANOGAprocedure/data/sample/.
28.Start jActiveModules plugin from Cytoscape->Plugins->jActiveModules.
29.Select SPOTPvaluesig and FSScorePvaluesig from Expression Attributes panel.
30.In the General Parameters panel, set “Number of Modules” parameter as 1000. “Overlap Threshold” parameter defines max percent of overlap between any two identified subnetworks. The default value used in PANOGA_protocol is 0.5.
31.Click “Find Modules” to identify active sub-networks.
32.Save the result as text file into PANOGAprocedure/data/$DISEASENAME/$DISEASE NAMEjactivemodulesoutput.txt via clicking into “Save All Results” button on “Results Panel”. Replace $DISEASENAME with the disease name that you specified in Step 3.
Parse jActiveModules output
33.Create a folder with your disease name under PANOGAprotocol/ClueGO/data/ and under PANOGAprotocol/ClueGO/out/ via typing the following commands:
Replace $DISEASE_NAME above with the disease name that you specified in Step 3.
34.Run “parsejactivemodulesoutput.jar” to create individual files containing gene symbols for each identified sub-network: Replace $DISEASE_NAME with the disease name that you specified in Step 3.
java -jar parsejactivemodulesoutput.jar $DISEASENAME At the end of this step, for the sub-networks with scores higher than 3, individual files containing gene symbols are saved under PANOGAprocedure/ClueGO/data/ $DISEASE_NAME/ and the number of subnetworks created is printed on the screen.
Functional enrichment of subnetworks
35.Decide which pathway resource you would like to use for the functional enrichment of the identified subnetworks. ClueGO 60 assigns a set of genes into KEGG 61 or BioCarta pathways—follow option A to assign genes into KEGG pathways, option B to assign genes into Biocarta pathways.
36.At this step, the users need to choose one of the following two options, depending on their operating systems: Windows Users, follow option A; Linux Users, follow option B. For both options, replace $DISEASENAME with the disease name that you specified in Step 3, $NUMBEROF_SUBNETWORKS with the number of subnetworks, as created in Step 34. Type the following command to perform functional enrichment for each of the identified sub-networks using the clueGO.props file created in Step 35:
If java is installed as root user, skip $OPTIONALJAVAPATH and run as: e.g. ./functionalenrichment.sh diabetes 508 If java is already installed on a different path, specify $OPTIONALJAVAPATH and run as: e.g. ./functionalenrichment.sh diabetes 508 ../../jre1.7.004/bin At the end of this step, enrichment results of each of the identified sub-networks are saved under PANOGAprocedure/ClueGO/out/$DISEASE_NAME/ for both options.
Combine functional enrichment results
37.Run the java script “combinesubnetworkpathways.jar” to create SNP targeted pathway lists and gene list for identified SNP targeted pathways. Replace $DISEASENAME with the disease name that you specified in Step 3, $NUMBEROF_SUBNETWORKS with the number of subnetworks, as created in Step 34.
java –jar combinesubnetworkpathways.jar $DISEASENAME $NUMBEROF_SUBNETWORKS At the end of this step pathway based lists and gene list are created as explained in the “Anticipated Results” section and “PANOGA’s Application to Human Complex Diseases” subsection of Introduction section.
Visualize SNP targeted genes in a KEGG pathway map
38.Create a directory under PANOGA_protocol/out/ to store gene attribute files for each pathway, via typing the following command:
Replace $DISEASE_NAME above with the disease name that you specified in Step 3.
39.Run the java script “createattributesforpathwaymap.jar” to create a gene attributes file for each identified pathway, which will be used in the next step to customize KEGG pathway maps. Each pathway specific file contain identified gene symbols and color specifications depending on the number of SNP targeted genes per base pair. Replace $DISEASE_NAME with the disease name that you specified in Step 3.
java –jar createattributesforpathwaymap.jar $DISEASENAME At the end of this step, gene attribute file for each of the identified sub-networks are saved under PANOGAprotocol/out/KeggPathwayMapGeneAttributeFiles/$DISEASE_NAME
40.Color SNP targeted genes for the pathway of interest using the KEGG Mapper – Color Pathway tool available at: http://www.genome.jp/kegg/tool/map_pathway3.html.
41.Type “hsa” followed by the KEGG Term ID for the pathway of interest to the “Select KEGG pathway map:” field. KEGG Term IDs of the pathways can be obtained from the first column of the $DISEASENAMEpathwayssubnetworkgenes.csv file under PANOGAprocedure/out/$DISEASENAME/.
42.Browse gene attribute file created in Step 39 for the pathway of interest.
43.Hit “Execute” button. KEGG Mapper – Color Pathway tool 61 generates a customized pathway map, where the SNP targeted genes are colored based on the number of SNPs per base pair.
The time required to execute this protocol is most strongly related with the time required to run the jActiveModules plugin 59 of Cytoscape 68 to identify subnetworks and to obtain functional information of SNPs from F-SNP webserver 51. With a 1.5 Mbps ADSL connection under favorable operating conditions, it is reported to take approximately 9 min to download Cytoscape 69. For ~30,000 SNPs, it takes approximately 40 secs to get functional information from SPOT webserver 49; 20 secs from SNPinfo webserver 50; 10 mins from SNPnexus webserver 73; and 3 hours from F-SNP webserver 51. On a PC with 500 GB of memory, loading a human PPI network of ~10,000 nodes and ~80,000 edges requires approximately 10 secs; importing a gene attribute file with ~4,000 genes and 2 separate attributes for each gene requires 4secs. On such a network, the subnetwork identification step of PANOGA takes approximately 4 hours once jActiveModules plugin 59 is used with two attributes for each node. On a PC with the same configurations, ClueGO plugin requires approximately 45 mins for functional enrichment of ~500 subnetworks. The running times of each of the unspecified jar programs take less than 5 mins. Overall, an experienced user can execute the full protocol described within 9 hours.
Further instructions on how to increase memory space for Cytoscape are available at: http://www.cytoscape.org/cgi-bin/moin.cgi/How_to_increase_memory_for_Cytoscape.
PANOGA will create pathway and gene tables with several features in .csv format (comma separated values) as shown in Table 1 Table 2 Table 3 and Table 4 , respectively. The files can be opened by Microsoft Excel or Open Office and displayed as spreadsheets. Each row of the pathway spreadsheet will correspond to the features of the identified pathway, i.e., KEGG term, KEGG term ID, p-value, rank, number of times found significant for different subnetworks, number of SNP targeted genes, number of typed SNPs in GWAS that are associated with the genes as part of the pathway under study, number of regulatory SNPs which are also found significant in GWAS, SNP targeted genes and their SNP counts.
Gene table file, as shown in Table 4 , will include different features of the genes that are found as part of the identified pathways. While some of these genes are SNP targeted genes, some others are identified as the neighbours of SNP targeted genes within the generated sub-networks. Each row of the gene spreadsheet will correspond to a gene symbol, entrez gene ID, number of times found in subnetwork, number of associated pathways, list of associated pathways, number of typed SNPs in GWAS, functional information of the typed SNPs in GWAS, SNP regulatory potential, number of regulatory SNPs. If the number of typed SNPs in GWAS is zero, that means this gene is identified through neighbour effect.
PANOGA will also create customized KEGG pathway map for pathway under study, as shown in Figure 3 . In this map, the shade of red color in genes map to the number of targeted SNPs (typed in the GWAS of RA), per base pair of the gene. Hence, this pathway map will help the users to visualize affected genes along different routes within the pathway map.
Figure 1: Outline of PANOGA protocol.
In Step 1, a gene-wise Pw-value for association with disease was computed by integrating functional information. In Step 2, significant Pw-values were loaded as two separate attributes of the genes in a PPI network and visualized using Cytoscape. At this step, active sub-networks of interacting gene products that were also associated with the disease, are identified using jActive Modules plugin. In Step 3, genes in an identified active sub-network were tested whether they are part of functionally important KEGG pathways. Step 4 integrates the functional enrichments of the generated sub-networks.
Figure 1: Outline of PANOGA protocol
Figure 2: PANOGA input files’ formats
GWAS result and Protein-protein interaction network file formats, required by PANOGA protocol.
Figure 3: Customized KEGG pathway map of an identified SNP targeted pathway.
Customized KEGG pathway map for JAK-STAT signaling pathway. The shade of red color in genes map to the number of targeted SNPs (typed in the GWAS), per base pair of the gene. Red refers to the highest targeted gene, whereas white refers to a gene product, not targeted by the SNPs.
Table 1: Pathway based representation of PANOGA results, focusing on SNP targeted genes.
The top 5 SNP targeted KEGG pathways are shown along with their KEGG term IDs, ranks and p-values in the 1st, 3rd and 4th columns, respectively. SNP targeted genes that are identified in PANOGA protocol are shown in the 5th column; along with the number of typed SNPs, shown in paranthesis. For each identified SNP targeted pathway, number of SNP targeted genes, number of associated SNPs in GWAS, number of regulatory GWAS SNPs and how many times this pathway is identified are shown in columns 6 to 9, respectively.
Table 2: Pathway based representation of PANOGA results, focusing on subnetwork genes.
The top 5 SNP targeted KEGG pathways are shown along with their KEGG term IDs, ranks and p-values in the 1st, 3rd and 4th columns, respectively. Pathway associated genes that are found in the subnetworks are shown in the 5th column. While the genes without ‘’ symbol are SNP targeted genes (e.g. JAK2 gene in the Jak-STAT signaling pathway), the genes with ‘’ symbol are identified in the subnetwork due to the neighbour effect (e.g. JAK1 gene in the Jak-STAT signaling pathway). The genes with neighbour effect (not targeted by SNPs) are incorporated using PPI network in the subnetwork identification step of PANOGA and they help to identify SNP targeted pathways, which can not be picked up using SNP targeted genes only. Column 6 displays other members (genes) of the identified snp targeted pathway, that are not found in PANOGA subnetworks.
Table 3: Pathway based representation of PANOGA results, focusing on associated SNPs from GWAS and their associated genes (SNP targeted genes).
The top 5 SNP targeted KEGG pathways are shown along with their KEGG term IDs, ranks and p-values in the 1st, 3rd and 4th columns, respectively. For each identified SNP targeted pathway, column 5 displays pathway associated genes found in subnetworks, along with the rs IDs of the typed SNPs and their functional consequences on the gene, shown in brackets. For example, STAT1 gene is a SNP targeted gene in Jak-STAT signaling pathway. Column 5 indicates that while rs3024912 and rs16833177 are located near the 3’ end of STAT1; rs1914408 and rs6718902 are located near the 5’ end of STAT1; and rs11687659 is located in the intronic region of STAT1.
Table 4: Gene list representation of PANOGA for the identified SNP targeted pathways.
For each SNP targeted gene, number of associated SNPs in GWAS, number of regulatory GWAS SNPs, how many times this pathway is identified, number of associated snp targeted pathways, list of these pathways, associated SNPs from GWAS, functional information regarding associated SNPs from GWAS, SNP regulatory potential and number of Regulatory SNPs are shown in columns 2 to 9, respectively.
A New Methodology to Associate SNPs with Human Diseases According to Their Pathway Related Context. Burcu Bakir-Gungor, Osman Ugur Sezerman, and Joaquín Dopazo. PLoS ONE 6 (10) 25/10/2011 doi:10.1371/journal.pone.0026277
Burcu Bakir-Gungor, Sezerman Lab, Sabanci University
Osman Ugur Sezerman, Sabancı University
Correspondence to: Burcu Bakir-Gungor ([email protected])
Source: Protocol Exchange (2012) doi:10.1038/protex.2012.019. Originally published online 29 May 2012.