Biochemistry Genetics and Genomics Computational Biology

scientificprotocols authored about 3 years ago

Authors: Zhiyong Huang, Guangmei Yan, Jun Wang, Xiaoning Wang, Jian Wang, Guojie Zhang, Xiaodong Fang, Cai Li & Fei Ling


cynomolgus and Chinese rhesus macaque sequencing, assembly and analyse


1.SOAP denovo assembly

SOAPdenovo employs the de Bruijn graph algorithm in order both to simplify the task of assembly and to reduce computational complexity. Low quality reads were filtered and potential sequencing errors were removed by k-mer frequency-based error correction. We filtered the following type of reads:

  1. Reads having an ‘N’ over 10% of its length.
  2. Reads from short insert-size libraries having more than 65% of bases with quality ≤7, and reads from large insert-size libraries that contained more than 80% of bases with a quality ≤ 7.
  3. Reads with more than 10 bp from the adapter sequence (allowing ≤2 bp mismatches).
  4. Small insert size paired-end reads that overlapped ≥10 bp between two ends.
  5. Read1 and read2 of two paired-end reads that were completely identical (and were hence considered to be the products of PCR duplication).
  6. Reads having a k-mer frequency <4 (to minimize the influence of sequencing errors).
    • SOAPdenovo first constructs the de Bruijn graph by splitting the reads from short insert size libraries (200-500bp) into 31-mers and then merging the 31-mers ( 30bp overlaps with 1 bp overhangs); contigs were then collected which exhibited unambiguous connections in de Bruijn graphs. Reads from mate-paired libraries (insert size >2k) were aligned onto the contigs for scaffold building using the paired-end information. This paired-end information was subsequently used to link contigs into scaffolds, step by step, from short insert sizes to long insert sizes.

2.RNA-seq sequencing

  1. Homogenise frozen tissues in Trizol reagent in a bead mill with 5mm stainless steel beads.
  2. Follow the Trizol procedure, including two alcohol precipitations and suspension of the final RNA pellet in RNAse-free water.
  3. Construct RNA sequencing libraries using an Illumina standard mRNA-Seq Prep Kit. Briefly: Use oligo(dT) magnetic beads to purify the poly-A containing mRNA molecules. Further fragment the mRNA into short lengths by controlled temperature, and then randomly primed during first strand synthesis by reverse transcription. Follow this with second-strand synthesis with DNA polymerase I to create double-stranded cDNA fragments. Subject double stranded cDNA to end repair by Klenow and T4 DNA polymerases and A-tailed by Klenow lacking exonuclease activity.
  4. Ligation to Illumina Paired-End Sequencing adapters, size selection by gel electrophoresis and then PCR amplification complete the library preparation. Sequence the paired-end libraries sequenced on a Illumina Genome Analyzer for 100 bp at each end.

3.Gene prediction

use BLAT to map genes of IR (MMUL01) and human (Ensembl release-56) onto two macaca genome, Orthologous regions were then determined by best-BLAT hit and synteny-based analysis, followed by the application of Exonerate and GENEWISE to refine gene model at each locus.

4.Assembly quality validation in neutral mode

Neutral InDel model1 can be used to validate the quality of our genome assemblies.When aligning two closely related genome sequences, the frequencies of lengths of successive alignment blocks (which were split by gaps during the alignment), termed Inter-gap Segments (IGS), may be expected to follow a geometric frequency distribution under a standard neutral model.Within the neutral evolving regions, incorrect InDels introduced during the assembly process would result in the observed IGS length distribution departing from the geometric distribution. The introduced InDels would generate an excess of short IGS over the number predicted by the neutral InDel model. By quantifying this excess, several parameters viz. the proportion (ɛ), average density (D), and number (Ng) of the clustered erroneous gaps in the genome alignments can be estimated.


  1. Meader, S., Hillier, L. W., Locke, D., Ponting, C. P. & Lunter, G. Genome assembly quality: assessment and improvement using the neutral indel model. Genome Res 20, 675-684.

Associated Publications

Genome sequencing and comparison of two nonhuman primate animal models, the cynomolgus and Chinese rhesus macaques. Guangmei Yan, Guojie Zhang, Xiaodong Fang, Yanfeng Zhang, Cai Li, Fei Ling, David N Cooper, Qiye Li, Yan Li, Alain J van Gool, Hongli Du, Jiesi Chen, Ronghua Chen, Pei Zhang, Zhiyong Huang, John R Thompson, Yuhuan Meng, Yinqi Bai, Jufang Wang, Min Zhuo, Tao Wang, Ying Huang, Liqiong Wei, Jianwen Li, Zhiwen Wang, Haofu Hu, Pengcheng Yang, Liang Le, Peter D Stenson, Bo Li, Xiaoming Liu, Edward V Ball, Na An, Quanfei Huang, Yong Zhang, Wei Fan, Xiuqing Zhang, Yingrui Li, Wen Wang, Michael G Katze, Bing Su, Rasmus Nielsen, Huanming Yang, Jun Wang, Xiaoning Wang, and Jian Wang. Nature Biotechnology doi:10.1038/nbt.1992

Author information

Zhiyong Huang, Jun Wang, Jian Wang & Guojie Zhang, Beijing Genomics Institute, Shenzhen

Guangmei Yan & Xiaoning Wang, The South China Center for Innovative Pharmaceuticals, Guangzhou 510663, China

Xiaodong Fang, Cai Li & Fei Ling, Unaffiliated

Correspondence to: Guangmei Yan ([email protected]), Jun Wang ([email protected]), Xiaoning Wang ([email protected]), Jian Wang ([email protected])

Source: Protocol Exchange (2011) doi:10.1038/protex.2011.264. Originally published online 4 November 2011.

Average rating 0 ratings