scientificprotocols authored almost 8 years ago
Authors: Haseeb Ahmad Khan
The gene expression profiling could aid the physicians to better understand the cellular morphology, resistance to chemotherapy and overall clinical outcome of disease [1,2]. Such individualized treatment may significantly increase survival due to the optimization of treatment procedure according to clinical pathogenesis. Ein-Dor et al [3] have pointed out that the gene sorted for the same clinical types of patients but different groups differed widely and possessed only few genes in common. An explanation to this lack of overlap between predictive signatures from different studies with the same goal may be due to the presence of more predictive genes than required to design an accurate predictor [4]. However, the microarray technique itself has been shown to be highly reproducible within and across two high volume laboratories [5]. Numerous statistical procedures including t-test [6,7], analysis of variance [8], Pearson correlation [9], Wilcoxon signed-rank test [10,11] and Mann Whitney U test [12,13] have been used for comparison of microarray data. However, the validity of various conventional statistical methods for two-group comparison of gene signatures was never evaluated using carefully selected data sets. A novel algorithm with software support is presented herein for more realistic and comprehensive interpretation of gene signatures.
Computational method and theory of CalcHEPI
The formula used for computation of HEPI score is HEPI=Σ [(Ni(0→t) Sj(0→1)/Nt]x100. Where Ni is the number of genes with Score Sj. The subscript ‘i’ may vary between 0 and total number of genes in the signature and ‘j’ may vary between 0 (minimum score) and 1 (maximum score). Nt is the total number of genes in the signature. First, all the ratios of expression data are categorized according to a logical scale to get the respective Ni and Sj values. The percent contributions of each set of genes (genes with same expression score) are computed and then summed up to get HEPI score. The fold-change strategy used in HEPI scores is robust, accurate and reproducible. Although the concept of fold-change has been described in microarray experiments it has never been utilized for collective interpretation of gene signatures. Technically, the ratio of the color intensity of each spot (probe) measures the relative expression of the corresponding gene under two different experimental conditions. In general, a gene is said to be differentially expressed if the ratio in absolute value of the expression levels between the control and treated group exceeds certain thresholds. The most acceptable expression ratios for up- and down-regulated genes have been suggested as >1.5 and <0.5 respectively [11,17,18]. While adopting the same cut-off values, additional sub-grading has been proposed in this protocol. HEPI scores are simple to interpret, easy to compare and prominent for visual cross checking.
Software design
CalcHEPI software has been developed in Microsoft Excel platform due to Excel’s flexibility, universal availability, and macro-based automation. Moreover, the spreadsheet layout of Excel is perfectly suitable for storing and analyzing microarray data as well as developing microarray analysis software. The data selection is controlled by input box to allow the users to select the paired expression values from any place of the worksheet (Fig. 1). The software then utilizes Excel’s worksheet formula function together with a macro subroutine to compute HEPI scores (Fig. 2). The percent contribution of norm-regulated (green), down-regulated (blue) and up-regulated (red) genes is also shown as a color-coded bar. The output of the software provides a comprehensive understanding of the results in terms of both qualitative (up- or down-regulation) and quantitative (gradation in fold-change) analysis of gene signature with the quick review indicator bar. The clarity and integrity of report format are quite helpful for any cross evaluation. HEPI scores are valid for any size of array signature as they are calculated according to percent (and not number) of differentially expressed genes on a 10 point scale (5 for up regulation and 5 for down regulation).
Software validation
The functional accuracy and reliability of software have been validated using the simulated and real gene signatures data for two-group comparisons. Six pairs of expression data were specifically designed to represent various degrees of similarity/differences (details not shown). Among them, the two groups in pair 4 are not significantly different whereas the groups in pair 6 possess maximum difference. All these 6 pairs were subjected to nonparametric comparisons with Mann-Whitney U test, Kolmogorov-Smirnov test, Kruskal-Wallis test, Wilcoxon signed-rank test, Sign test, Friedman test and Kendall W test using SPSS (Version 10). The real expression data of published signatures including ovarian carcinoma [14], ulcerative colitis [15], leukemia [16] and adenocarcinoma [6] were also analyzed by the above nonparametric tests as well as CalcHEPI. The characteristics of these real signatures have been summarized in our earlier report [10].
A personal computer with Microsoft Excel program.
Installation of CalcHEPI Add-in
Running the program
Once the installation of Add-in has been done, the computation of HEPI score is performed instantly after selection of the desired gene expression data.
Installation of Add-in for Excel 2007
The instructions for Add-in installation given above are valid for Excel 2003. For installing the Add-in in Excel 2007, follow the following steps:
Troubleshooting
So far, I have not faced any problem with the installation or running this Add-in. However, the users are requested to contact the author in case they encounter any problem associated with this software.
The anticipated results format is shown in Fig. 2. For users’ information, the results of software validation using simulated and real signatures clearly demonstrated the incompatibility of conventional statistical methods for comparing gene expression data. Paradoxical outcomes were observed while comparing 6 simulated gene signatures using 7 nonparametric tests (Table 1). Five tests including Mann-Whitney U test, Kruskal-Wallis test, Sign test, Friedman test and Kendall W test resulted same but logically unrealistic P values for all these signature pairs. Surprisingly, these tests showed P=1 for a gene signature with maximum difference (Pair 6, HEPI = 100) and P = 0.001 for a signature with a slight difference (Pair 5, HEPI = 4) (Table 1). The remaining two tests including Kolmogorov-Smirnov test and Wilcoxon signed-rank test also failed to effectively handle these comparisons. For instance, Kolmogorov-Smirnov test resulted P = 0.001 both for similar groups (Pair 4, HEPI = 0) as well as the groups with maximum difference (Pair 6, HEPI = 100). Wilcoxon signed-rank test showed ambiguous results for Pairs 3 and 5 (Table 1). Statistical inconsistency also prevailed while comparing real signatures (Table 1) affixing a question mark on the reliability of nonparametric tests for two-group comparison of gene signatures.
Thus the conventional statistical methods may not be able to handle the peculiar microarray expression data, particularly for two-group comparison of gene signatures. More accurate and unified statistical methods and/or coding systems are therefore needed to ensure routine and uniform clinical application of gene signatures. CalcHEPI is one such effort that may serve as a convenient and robust tool for two-group comparison of gene signatures.
The author is highly thankful to the research groups of Dr. Kai Wang (Chiroscience R&D, Inc., Bothell, WA, USA); Dr. Thomas P. Dooley (IntegriDerm Inc., Birmingham Alabama, USA); Dr. Todd R. Golub (Massachusetts Institute of Technology, Cambridge, MA, USA) and Dr. Daniel A. Notterman (Princeton University, Princeton, NJ, USA) for using their published data to validate CalcHEPI protocol.
Figure 1: Data input box in the Excel worksheet.
Excel worksheet displaying a portion of gene expression data of signature-D as well as the functioning of CalcHEPI software. The figure also shows the selection of paired gene expression values for all 66 genes (Cell L2 to Cell M67). Clicking the ‘OK’ button executes the program and the results are displayed (as shown in Fig. 2).
Figure 2: Results window of CalcHEPI software.
A representative output of the results for ‘Signature-D’. Color bar represents percent contribution of norm-regulated (green, absent in this case), down-regulated (blue) and up-regulated (red) genes.
Table 1: Software validation for two-group comparisons of (A) simulated gene expression data and (B) real gene signatures using different statistical methods and CalcHEPI.
Owing to the peculiarity of expression data, all the conventional statistical tests appear to be invalid for two-group comparisons. A huge disparity in P values can be seen while using different statistical tests. However, HEPI provides a realistic quantitative evaluation of differential gene expression with in-depth information about expression pattern (Fig. 2).
Haseeb Ahmad Khan, King Saud University, Riyadh, Saudi Arabia
Source: Protocol Exchange (2009) doi:10.1038/nprot.2009.106. Originally published online 22 May 2009.