Genetics and Genomics Computational Biology Microbiology

scientificprotocols authored over 3 years ago

Authors: James Robert White & W. Florian Fricke


Next-generation sequencing has been successfully used to characterize microbial communities based on the amplification and sequencing of phylogenetic marker genes, e.g. the 16S rRNA gene. In comparison to 16S rRNA amplicon sequencing-based procedures for bacterial and archaeal microbiota analysis, few comparable protocols have been made available to study fungal organisms. Here we describe the CloVR-ITS protocol for fungal microbiota analysis using internal transcript spacer (ITS) amplicon sequencing. CloVR-ITS includes well known bioinformatic tools for alpha and beta diversity analyses, suitable to process even large sequence datasets:

  • A) QIIME (1) for sequence processing and beta phylogenetic analysis using different methods including UCLUST (2);
  • B) UCHIME (3) for rapid identification of chimeric sequences;
  • C) Mothur [4] for alpha diversity and ecological parameter calculations;
  • D) BLASTN (5) for taxonomic sequence assignments using custom databases;
  • E) Metastats [6] and custom R scripts for statistical and graphical evaluations.

CloVR-ITS accepts as input either a single raw multiplex 454-pyrosequencer output file (i.e. pooled barcoded sequences from multiple samples), or alternatively, pre-processed sequences from multiple samples in separate files. CloVR-ITS is available as part of the CloVR package (


Table 1

Table 2


A. Requirements for pipeline Input

To run the full CloVR-ITS analysis pipeline, at least two different inputs have to be provided by the user: one or multiple sequence file(s) and a samplemetadata mapping file. Sequence data may consist of a single fasta file that contains pooled and barcoded sequences from multiple samples, or multiplepre-processed fasta files with each sample being represented by a single fasta file. No two fasta headers within any submitted file may be identical. Themapping file provides sample-associated metadata information used for beta diversity analysis.

A.1 Mapping file requirements for single sequence file input

Example for tab-delimited table to be used as mapping file in combination with a single sequence input file (pooled, barcoded, unprocessed 454 sequences),in QIIME format:

#SampleID BarcodeSequence LinkerPrimerSequence Treatment_p Description





The following rules apply:

1.All entries are tab-delimited.

2.All entries in every column are defined (no empty fields).

3.The header line begins with the following fields:


4.The header line must end with the field Description, i.e. the total number of columns is four or more.

5.The BarcodeSequence and LinkerPrimerSequence fields have valid IUPAC DNA characters.

6.There are no duplicate header fields and no duplicate entries in the #SampleID column.

7.No header fields or corresponding entries contain invalid characters (only alphanumeric and underscore characters allowed).

8.There are no duplicates when the primer and barcodes are appended.

A.2 Mapping file requirements for multiple sequence file input

Example for tab-delimited table to be used as mapping file in combination with multiple sequence input files (one FASTA file per sample):

#File SampleName php Treatment Temperaturep Description

A.fasta sampleA high control mild patientA

B.fasta sampleB high sick medium patientB

C.fasta sampleC low treated high patientC

The following rules apply:

1.All entries are tab-delimited.

2.All entries in every column are defined (no empty fields).

3.The header line begins with: #FileSampleName.

4.The #File column contains the names of all input fasta files and does not contain duplicate entries.

5.There are no duplicate header fields.

6.No header fields or corresponding entries contain invalid characters (only alphanumeric and underscores characters allowed).

A.3 Pairwise comparisons with Metastats

To utilize the Metastats statistical methodology, which detects differential abundances of taxa between two sample groups, the associated header field mustend with ”p”, (e.g. “Treatmentp”, or “php”). If a header with the ”p” ending exists, pairwise Metastats calculations will be carried out between allgroups specified in the corresponding column (provided that a group contains at least three samples).

A.4 Providing quality scores with sequence data

To include quality scores as input, for each input fasta file .fasta there must exist a separate quality score file .qual. Forexample, if the input fasta files are A.fasta, B.fasta and C.fasta, then there must also exist A.qual, B.qual, and C.qual for quality filtering to beperformed[1]. The quality score files are tagged similarly to the input fasta files before starting a pipeline.

B. Sequence preprocessing

Input data are initially assessed for quality and chimeric sequences. Problematic sequences are removed before subsequent processing.

B.1 File consistency check

All input fasta files are first checked for consistency with the input mapping file. If a fasta file listed within the mapping file does not exist, or ifan input fasta file is not listed in the mapping file, the pipeline will halt with an error. Likewise consistency is checked for any input quality scorefiles.

B.2 Splitting by samples and quality filtering

To check each read from the sequence pool for quality and to sort sequences based on the sample-specific barcodes, the QIIME script isused with the following parameters:—min-seq-length 100(sets the minimum sequence length to 100 bp)—max-seq-length 2000(sets the minimum sequence length to 2000 bp)—barcode-type variablelength(disables barcode corrections and allows for unique barcodes with varying lengths)—max-homopolymer 8(sets the maximum homopolymer length to 8 bp)—min-qual-score 25(sets the minimum average quality score to 25, applies only when quality scores are provided to the pipeline)—max-ambig 0(sets the maximum number of ambiguous bases allowed to 0)The output of this component (seqs.fna) is a single set of filtered reads identified by sample and meeting the above quality criteria.

B.3 Selection of high identity clusters

To assist in de novo chimera detection and downstream taxonomic analysis, sequences are clustered into high identity OTUs using a 99% identitythreshold and the QIIME command We allow for reverse complement searching by UCLUST here. The longest sequences in each stringent cluster areselected as OTU representatives using The relative abundance of each OTU is denoted with each representative sequence for UCHIME.

B.4 Chimera identification and removal

To detect putative chimeric sequences in the filtered data, representative sequences are input to UCHIME (using de novo mode with defaultparameters). Representatives assigned as chimeras propagate the assignment across their clusters, and a single list of all putative chimeras is output. Allchimeric sequences are then removed from consideration before the next step in the pipeline.

C. Sequence processing

C.1 Sequence clustering

The QIIME script is used to cluster all non-chimeric reads from all samples into genus-level operational taxonomic units (OTUs) based on anucleotide sequence identity threshold. The clustering program for this step is UCLUST [2] and the nucleotide sequence identity threshold for all reads within an OTU is 85%. UCLUST is set to examine both the forward and reverse complementsequences during clustering.

C.2 Alpha-diversity analysis.

Genus-level OTUs created by the QIIME commands above are reorganized and input to Mothur which uses the scripts read.otu, rarefaction.single, andsummary.single to generate rarefaction curves and estimators of species diversity for each sample. Finally a custom program called Leech is used to plotall rarefaction curves together defined by varying color schemes related to the input mapping file.

C3. Taxonomic annotation of high identity clusters

All non-chimeric representative sequences from the 99% clusters generated in step B.3 are searched against a custom database of ITSreference sequences from known species (clovritsdb v1.0) using BLASTN with the following options: ”-e 1.0e-5” (e-value threshold), ”-b 10” (numberof hits to show) and ”-m 8” (tabular output). Each sequence is assigned to the taxonomic lineage of the best BLAST alignment covering at least 90% of thequery sequence length and matching with a minimum identity of 90%, 85%, 75%, 70%, and 60% identity for species, genus, family, order, and class-levelassignments, respectively. Representatives without alignments of sufficient coverage or identity at a specific taxonomic-level are denoted as”Unclassified.” Hits are propagated across the corresponding clusters.

D. Additional analysis using Metastats and the R statistical package

The output from the taxonomic classification of each sequence from all samples by the BLAST-based classification step is further analyzed and graphicallyrepresented using the Metastats program [6] and customized scripts in the R programming language.

D.1 Detection of differentially abundant features

Metastats uses count data from annotated sequences to compare two populations in order to detect differentially abundant features [6]. BLASTN results are processed to detect different taxonomic groups at multiple levels (class, order, family, genus, species). Metastats produces atab-delimited table displaying the mean relative abundance of a feature, variance and standard error together with a p value and q value to describesignificance of the detected variations (see project website: Note Metastats can run analyses of 1 sample vs. 1 sample, orN samples vs. M samples, where N and M are greater than 1. It cannot perform a comparison of 1 sample vs. 2 samples.

D.2 Stacked histogram generation

Custom R scripts are used to normalize taxonomic group counts to relative abundances. Stacked histograms of the relative abundances are generated in the.pdf format, if there are at most 50 samples and at most 25 taxon groups. Beyond these limits a visualized histogram is not generated.

D.3 Unsupervised sample clustering

A custom R script called skiff is used to normalize taxon counts and to calculate distance matrices for samples and taxonomic groups, using a Euclideandistance metric. Complete-linkage (furthest neighbor) clustering is employed to create dendrograms of samples and taxa in the .pdf format. The R packagesRColorBrewer and gplots are included in this task.

D.4 Pie chart visualization

Custom R scripts are used to form pie charts displaying proportions of sequences assigned to specific functional and taxonomic levels for up to 12 samples.Outputs are in .pdf format. For more than 12 samples this function is not performed, as the visual comparison for the user would be cumbersome.

[1]Note: fasta and quality files can be retrieved from an sff file using the Roche/454 proprietary program sffinfo.

Anticipated Results

Table 3

Table 4


  1. Caporaso J.G. et al. QIIME allows analysis of high-throughput community sequencing data. Nat Methods. 7:335-336 (2010).
  2. Edgar R.C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26(19):2460-2461 (2010).
  3. Edgar R.C. et al. UCHIME improves sensitivity and speed of chimera detection. Bioinformatics 27:2194-2200 (2011).
  4. Schloss P.D. et al. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol. 75:7537-7541 (2009).
  5. Altschul S.F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-3402 (1997).
  6. White J.R. et al. Statistical methods for detecting differentially abundant features in clinical metagenomic samples. PLoS Comput Biol. 5:e1000352 (2009).
  7. Koetschan C. et al. ITS2 database IV: Interactive taxon sampling for internal transcribed spacer 2 based phylogenies. Mol Phyl Evol. 63(3):585-588 (2012).


This material is based upon work supported by the National Science Foundation under Grant No. 0949201 and by the National Human Genome Research Institute under Grant No. 5RC2HG005597-02.


Table1SingleSequenceInputMappingFile: Table 1. Mapping file, example 1.

Download Table1SingleSequenceInputMappingFile

Example for tab-delimited table to be used as mapping file in combination with a single sequence input file.

Table2MultipleSequencesInputMappingFile: Table 2. Mapping file, example 2.

Download Table2MultipleSequencesInputMappingFile

Example for tab-delimited table to be used as mapping file in combination with multiple sequence input files (one FASTA file per sample).

Full text document: CloVR-ITS Protocol

Download Full text document

Associated Publications

CloVR-ITS: Automated internal transcribed spacer amplicon sequence analysis pipeline for the characterization of fungal microbiota. James Robert White, Cynthia Maddox, Owen White, Samuel V Angiuoli, and W Florian Fricke. Microbiome 1 (1) doi:10.1186/2049-2618-1-6

Author information

James Robert White & W. Florian Fricke, Institute for Genome Sciences, University of Maryland School of Medicine

Correspondence to: James Robert White ([email protected]) W. Florian Fricke ([email protected])

Source: Protocol Exchange (2013) doi:10.1038/protex.2013.017. Originally published online 14 February 2013.

Average rating 0 ratings