Genetics and Genomics Computational Biology

scientificprotocols authored over 7 years ago

Authors: Anguraj Sadanandam, Eric Collisson, William Gibb, Douglas Hanahan & Joe Gray


In this protocol, we describe methodologies used in the Nature Medicine manuscript “Subtypes of Pancreatic Ductal Adenocarcinoma and their Differing Response to Therapy”. Herein, we describe the computational pipeline used to identify subtypes of pancreatic ductal adenocarcinoma and their associated gene signature using integrated analysis of various gene expression profiles with clinical outcome.


Pancreatic ductal adenocarcinoma (PDA) is an invariably lethal disease. There is variable and often disappointing responses seen when deploying the standard treatment, gemicitabine alone or in combination with other chemotherapeutic agents in unselected PDA populations. Seeking to extend the new paradigm of personalized medicine to PDA, we performed an integrative analysis of transcriptional profiles of human primary PDA tumors (n = 143) and cell lines (n = 35). We have defined 62 gene (PDAssigner) signature that classifies the human PDA tumors into three subtypes (classical, QM-PDA and exocrine-like tumors) and present evidence that two (classical and QM-PDA) of these three are present in human and mouse PDA cell lines. Interestingly, we found that PDA subtype classification is an independent predictor of overall survival, and patients with classical PDA subtype tumors have significantly better survival than those with QM-PDA tumors. Intriguingly, treatment of PDA cell lines with gemcitabine or erlotinib shows significantly differential and opposing responses between classical and QM-PDA subtypes. Herein, we describe computational protocols for a broadly integrative approach to subtype discovery in cancer and analysis that identifies gene expression signature with prognostic and possible predictive significance in primary patient samples.

Non-Matrix Factorization Algorithm.

Instead of characterizing gene expression profiles in terms of thousands of genes, non-negative matrix factorization (NMF) attempts to find a reduced-order model that describes the principal data structures in terms of a relatively small number of coordinates. In this regard, NMF is similar to more familiar dimensionality reduction methods, such as principal component analysis, except that NMF’s non-negativity constraint produces a “parts-based” composite picture of the overall gene expression profile. It has been suggested that this composite picture lends itself to biological interpretation more readily than other dimensionality reduction methods (1).

Because the focus of NMF is to capture the principal gene patterns, it is not generally possible to recapitulate all details of the original gene expression profile based solely on the reduced-order, NMF model. That is to say there will be differences between the original profile and the recapitulated profile. Moreover, for any given data set, the reduced-order model derived from the NMF algorithm will not be unique—there will be slight differences in the model depending on the initial conditions used by the NMF algorithm to iteratively search for the best (i.e. most representative) model. However, when the gene expression clusters are well-differentiated in the data, the model will be relatively insensitive to the initial conditions used in the model search; hence differences between one model and another will be small.

As with most dimensionality reduction methods, one can make use of the reduced-order, NMF model to build a pattern classifier. The pattern classifier can then be used to classify each of the samples into subtypes. If the gene expression clusters are well-separated in the data set, then it follows from the above discussion that there will be only slight differences in the respective pattern classifiers derived from the models, hence there will be little variation on the subtype assignment. Consensus matrices visually capture this variation by color-coding the frequency with which any two samples appear in the same cluster. The frequency can vary from 0% of the pattern classifiers (blue) in which case the two samples never appear in the same cluster, to 100% of the pattern classifiers (red) in which case the two samples always appear in the same cluster.

While the consensus plot provides a visually understandable graphical representation of clustering robustness, for the purpose of selecting the order of the reduced-order model, it is desirable to have a scalar quantity that can be used to compare overall robustness for differing numbers of clusters. The cophenetic coefficient (1) accomplishes this by quantifying the diffuseness of the boundaries between blue and red regions in the consensus plot. Predominantly sharp, crisp boundaries yield a cophenetic coefficient near 1 and correspond to well-differentiated sample clusters, whereas diffuse boundaries yield a cophenetic coefficient closer to 0 and correspond to poorly-differentiated sample clusters.


Part I. Computational Analysis for Subtype Identification

In this section, we will describe the methods used to preprocess the microarray data, merge different gene expression microarray datasets, NMF analysis for the identification of subtypes and SAM analysis to identify subtype-specific gene signature.

A. Preprocessing of Microarray Data

Preprocess the CEL files from different platforms of Affymetrix microarrays before further analysis. Use R (frame work for statistical computing) and Bioconductor (2) (open source software for bioinformatics) for the preprocessing.

1. Preprocessing using Bioconductor package “affy”. Make sure that the required CEL files are in the same directory where R has been started. The R code for this step is as follows:

Code for preprocessing CEL files by robust multiarray (RMA) statistics


data <- ReadAffy()

data_rma <- rma(data)

dataexp <- exprs(datarma)

Note: The expression values are in log2 scale.

Code for mapping probesets to gene symbol


x = unlist(as.list(hgu133plus2SYMBOL)

Note that here “hgu133plu2.db” has been used as an example for those microarrays done using Affymetrix GeneChip Human U133Plus 2.0 array. Install and include the relevant library depending on the type of Affymetrix array used. x will contain the gene symbol corresponding each probe.

dataexp1 <- cbind(x, data_exp)



This will provide a table of RMA processed gene expression profile data with gene symbols. The samples are in columns and genes are in rows.

2.Correcting batch effect. We have microarrays from microdissected (UCSF) PDA performed in different batches (years – 2006, 2007 and 2008). Correct batch effects using R based software ComBat (3). Prepare two files – gene expression index file and sample information file. Gene expression index file is a delimited text file with first column containing gene names or probe IDs and rest of the columns with samples and gene expression indices. Sample information file contains first column with Array name, second column with sample name and third column with batch numbers (eg., 1, 2, ...).

Download and install ComBat as described in the website. Use ComBat as follows.

Note: Make sure to execute ComBat in its installed directory or include the directory in the Source comment below.


ComBat(“your expression file name”, “your sample information”)

The results with batch adjusted data will be written in the same directory.

3.Assessing Array Quality. We performed normalized unscaled standard error (NUSE) (4) to assess the quality of the arrays. NUSE is available as a part of affyPLM (a model based quality control assessment of Affymetrix GeneChips) Bioconductor package. Assess microarray quality as follows.




Note: Make sure that the required CEL files are in the same directory where R has been started.


Note – fitPLM fits a specified robust linear model to the probe level data. It calculates the standard error (SE) estimates for each gene on each array.

NUSE(Pset, main = “NUSE”)

NUSE accounts for differences in variability between genes across arrays. It standardizes the SE for each gene on each array calculated from fitPLM such that the median SE for that gene is 1 across all arrays. The above comment will produce a graph with y-axis representing the median SE for each array and helps to compare across arrays. Arrays with median SE or NUSE score of lesser than 1 + 0.25 and greater than 1 – 0.25 were further considered as better quality microarrays for our study.

In addition, obtain the NUSE statistics using the following R code.

NUSE(Pset, type = “stats”)

Those arrays with specified NUSE criteria can be used for further analysis.

4.Merging of different microarray datasets. We used distance weighted discrimination (DWD) (5 6) algorithm to merge different microarray datasets from different microarray platforms or laboratory. For this purpose, download Java based DWD and install it as described on the DWD description page. Before DWD-merging, screen each dataset for probesets/genes with standard deviation (SD) greater than 0.8. Also, select unique genes in the case where many probesets represent a single gene. This can be done by selecting only the gene-specific probe with the highest SD among all probes for that given gene. Additionally, remove those probes that do not map to any gene. The R code for this screening is available in the Attachments or Figures section of this article. You can download and execute accordingly. Column (samples) normalize (N(0,1)) and row (probe or gene) normalize (by median centering) each dataset. Use the R code below for normalizing and merging datasets using DWD.


dataUnique<-read.delim(“file name”, stringsAsFactors=FALSE)

Note – dataUnique should read the file generated from above SD filtering with unique genes as described above. The columns of the tab-delimited file should be samples (except first column being gene symbols) and rows should be genes.


Column normalization

colMax <- apply(dataUnique,2,max, na.rm=TRUE);

colMin <- apply(dataUnique,2,min, na.rm=TRUE);

scaleFactor <- 1.0/(colMax-colMin);

_for (i in 1:dim(dataUnique) [ 2]) {_

dataUnique[,i] <- (dataUnique[,i]-colMin[i])scaleFactor[i]*;


Row median centering

rowMed <- apply(dataUnique,1,median, na.rm=TRUE);


outFile1<- paste(strsplit(exprFile,”.txt”),”sd”,sdCutoff,”colNormrow_Med”,”.txt”,sep=”“);


colnames(dataUnique1) [ 1]<-”Genes”

write.table(file=”outFile”, dataUnique1, quote=FALSE, row.names=FALSE, sep=”\t”);


Note: Ensure that the final output file is numeric (e.g., avoid 1e-2, instead 0.01).

Merge the two individual SD screened and preprocessed datasets using installed Java based DWD. Remove the second row with dataset identifier added by DWD in the output file. Perform row median centering of the merged dataset before proceeding to the next step.

5.NMF analysis. The subtypes of PDA were identified in merged and row median centered microarray datasets using the non-negative matrix factorization (NMF) algorithm (1). R code for the NMF algorithm was downloaded from GenePattern website at the Broad Institute. Perform the NMF analysis as follows:


Note: make sure to include the directory where NMF R code is located.

nmfconsensus<-(input.ds=”merged file name in tab-delimited format”,k.init=2,,num.clusterings=20,maxniter=500, error.function=”euclidean”)

The output graphics and other required files will be saved in the directory where the code is executed. The “consensus.all.k.plot.pdf” file provides a sample x sample consensus matrix for different k’s and cophentic coefficient plot. Files with names ending with ”.gct” provide the classes for each sample.

SAM Analysis. Significant analysis of microarrays (7) was performed using a Bioconductor package, siggenes. The analysis was performed to identify gene signatures specific to each PDA subtypes identified by NMF analysis. Perform the analysis as follows:


dataMerged<-read.delim(“DWD merged file name”, stringsAsFactors=FALSE)

Numeric matrix was generated from above file.

dataMergedm<-matrix(as.numeric(unlist(dataMerged ( ,2:dim(dataMerged) ( 2)))),nrow=dim(dataMerged) ( 1), ncol=(dim(dataMerged) ( 2))-1)_

Note – The file used for this purpose is DWD merged files in tab-delimited format.

colnames(datam)<-colnames(data)(2:(dim(dataMerged) rownames(datam)<-data(,1)

Reading the classes for each sample from NMF output file

con<-read.delim(“consensus.k.3.gct”, stringsAsFactors=FALSE)





Note – number “12.2” is an example of delta value from SAM analysis and it varies from analysis to analysis and can be chosen based on false discovery rate (FDR) and false positive scores. We chose FDR < 0.001 and false positive being almost zero.

Final dataset with SAM selected genes

dataMergedgenes<-cbind(rownames(dataMergedm1),dataMerged_m1(genes, ))

write.table(dataMerged_genes,”SAM genes selected dataset.txt”, sep=”\t”, quote=FALSE,row.names=FALSE)

Finally, generate a Heatmap with above data file using Cluster software (8) and viewed using GenePattern’s Hierarchical Clustering Viewer (9)

Map the signatures from SAM onto other datasets as follows:

data_others<-read.delim(“other datasets name.txt”, stringsAsFactors=FALSE




write.table(data_others,”other datasets mapped with PDAssigner genes.txt”,sep=”\t”,quote=FALSE,row.names=FALSE)

Perform the Hierarchical clustering using the procedure described above.

Part II. Statistics for Clinical Outcome.

Use overall survival data from patients for univariate and multivariate analysis.

Use the log-rank test for univariate associations with survival. Perform survival (Kaplan-Meier) curve and log-rank test using web-based GenePattern software SurvivalCurve and SurvivalDifference, respectively. Provide a tab-delimited file with sample id, class for each sample (subtype information in our case), censor (0 – censored and 1 – not censored) and overall survival (in days) as an input for the program.

Use the Cox proportional hazards model for multivariate modeling of survival using the R package survival.


sur<-read.delim(“survival.txt”, stringsAsFactors=FALSE)

Note: survival.txt should have one column each for sample name, class or subtype information, censored data (0 – censor, 1 – not censored), time (in days, weeks or years), grade, stage in the same order.

coxph(formula = Surv(as.numeric(sur(,4)), as.numeric(sur(,3)))~as.factor(sur(,2)), data = sur)

where sur(,4) should be time, sur(,3) should be censored data (0 or 1), sur(,2) should be class or subtype (if not in factors, as.factor will convert to factors). Similarly, grade or stage can be analyzed.

Apply Fisher’s exact test (available in R package, stats) to investigate the relationships among subtype, stage and grade.


  1. Brunet, J.P., Tamayo, P., Golub, T.R. & Mesirov, J.P. Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci U S A 101, 4164-4169 (2004).
  2. Gentleman, R.C., et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 5, R80 (2004).
  3. Johnson, W.E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118-127 (2007).
  4. Bolstad, B.M., et al. Quality control of Affymetrix GeneChip data. in Bioinformatics and Computational Biology Solutions using R and Bioconductor (eds. Gentleman, R., Carey, V., Dudoit, S., Irizarry, R. & Huber, W.) (Springer, New York, 2005).
  5. Herschkowitz, J.I., et al. Identification of conserved gene expression features between murine mammary carcinoma models and human breast tumors. Genome Biol 8, R76 (2007).
  6. Benito, M., et al. Adjustment of systematic microarray data biases. Bioinformatics 20, 105-114 (2004).
  7. Tusher, V.G., Tibshirani, R. & Chu, G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A 98, 5116-5121 (2001).
  8. Eisen, M.B., Spellman, P.T., Brown, P.O. & Botstein, D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A 95, 14863-14868 (1998).
  9. Kuehn, H., Liberzon, A., Reich, M. & Mesirov, J.P. Using GenePattern for gene expression analysis. Curr Protoc Bioinformatics Chapter 7, Unit 7 12 (2008).


ScreenExpression.r: To screen variable and unique genes

Download ScreenExpression.r

This R code can be used to screen for those genes with standard deviation greater than certain cutoff. In addition, in cases where multiple probes are available for a single gene, it selects only one probe with highest standard deviation for that gene.

Associated Publications

Subtypes of pancreatic ductal adenocarcinoma and their differing responses to therapy. Eric A Collisson, Anguraj Sadanandam, Peter Olson, William J Gibb, Morgan Truitt, Shenda Gu, Janine Cooc, Jennifer Weinkle, Grace E Kim, Lakshmi Jakkula, Heidi S Feiler, Andrew H Ko, Adam B Olshen, Kathleen L Danenberg, Margaret A Tempero, Paul T Spellman, Douglas Hanahan, and Joe W Gray. Nature Medicine 03/04/2011 doi:10.1038/nm.2344

Author information

Anguraj Sadanandam & Douglas Hanahan, Ecole Polytechnique Federale de Lausanne

Eric Collisson, University of California at San Francisco

William Gibb, Genomic Health

Joe Gray, Oregon Health and Science University

Competing financial interests: William Gibb is an employee of Genomic Health.

Correspondence to: Joe Gray ([email protected])

Source: Protocol Exchange (2011) doi:10.1038/protex.2011.231. Originally published online 4 April 2011.

Average rating 0 ratings