Authors: Phillips Y.H. Huang, Yuyuan Han, Lusy Handoko, Stoyan Velkov, Eleanor Wong, Edwin Cheung, Xiaoan Ruan, Chia-Lin Wei, Melissa Jane Fullwood & Yijun Ruan
The three-dimensional organization of chromatin in the nuclear space is involved in regulation of gene expression. Circular Chromosome Conformation Capture (4C) is an established method for genome-wide screening of chromatin interactions associated with a given locus of interest, without prior knowledge of the identity of interacting partners. Briefly, 4C involves the cross-linking of chromatin material, followed by restriction enzyme digestion of the chromatin, proximity-based ligation of interacting DNA fragments within the same DNA-protein complex, and amplification of interacting sequences by inverse PCR. The use of restriction enzyme digestion together with PCR could lead to biases in detection as well as false positives through repeated identification of clonal products. Here, we present a modification of the original 4C method, in which sonication is used to randomly fragment chromatin fibers instead of restriction enzymes at specific sites, to eliminate these biases, thus enabling high-throughput analysis by next-generation sequencing to detect interacting sequences.
P.Y.H.H. and Y.H. made equal and critical contributions to this manuscript.
The three-dimensional organization of chromatin in the nuclear space is considered to be a critical contributor to the regulation of gene expression. The formation of chromatin loops, for instance, allows distal cis-enhancer elements to communicate with gene promoters and thereby up-regulate or down-regulate transcription (1,2).
High-resolution analysis of chromatin organization in vivo was made possible by the invention of the Chromosome Conformation Capture (3C) method (3), which has been widely used to detect and quantify interactions between genomic loci of interest. In this assay, intact cells are treated with formaldehyde to cross-link chromatin segments that are in close proximity. Cross-linked chromatin is then digested by a restriction enzyme and ligated under dilute conditions that favor the formation of junctions between interacting DNA fragments. Finally, the cross-links are reversed and site-specific PCR primers are used to quantitatively detect ligation events between selected pairs of restriction fragments. However, as the 3C technique assumes prior knowledge of interacting sequences, it cannot be used to screen for unknown interactions. To address this limitation, the Circular Chromosome Conformation Capture (4C) approach was developed (4,5,6,7), enabling genome-wide interrogation of DNA segments associated with a given locus of interest, also known as the “bait” region. A typical 4C protocol operates on the same principles pioneered by 3C: formaldehyde cross-linking is carried out to capture DNA-protein interactions, and restriction digest generates DNA fragments which are then subjected to proximity ligation. Unlike in 3C, however, the generation of circular DNA during ligation is central to the 4C strategy. These circularized molecules serve as the template for inverse PCR using bait-specific primers that are strategically positioned to amplify interacting DNA sequences flanking the target region. The amplified sequences are subsequently identified either by large-scale sequencing or by microarray analysis.
In 2009, Chromatin Interaction Analysis with Paired-End Tags (ChIA-PET) was developed for high-throughput, global, de novo detection of chromatin interactions without the requirement for a specific bait region (8,9). Briefly, chromatin immunoprecipitation is performed on sonicated, cross-linked chromatin to reduce the complexity of the library and allow investigation of specific transcription factors. The chromatin is subjected to proximity ligation, reverse cross-linked, and sequenced using next-generation sequencing methods. In addition, Hi-C was developed for global detection of chromatin interactions (10). Briefly, cross-linked chromatin is digested by restriction enzymes, subjected to proximity ligation, reverse cross-linked, and sequenced using next-generation sequencing methods. Blocks of DNA are then examined to determine the proximity of these regions of the DNA. As these methods are high-throughput, 4C methods can complement them through serving as validation studies, and in-depth analyses of specific regions.
While the 4C protocol in the general form described above has been used in several studies , (5,6,7), there are nonetheless certain technical limitations. A serious shortcoming of the original 4C technique lies in its inability to provide an accurate assessment of interaction frequencies and the possibility of false positives. It is important to recognize that the use of multiple rounds of PCR, both to select for bait-associated interactions and to generate sufficient DNA for sequencing or microarray hybridization, inevitably leads to massive clonal amplification of interacting DNA sequences. This could lead to false positives or negatives that would complicate the interpretation of sequencing or microarray data, making it difficult to determine true interaction frequencies. Generally, multiple independent rounds of 4C on different biological samples which show repeated PCR products has been accepted to indicate that the chromatin interactions are bona fide; however, if a particular DNA sequence is always enriched because of PCR preferential clonal amplification biases, the chromatin interaction could nonetheless be a false positive.
Another major issue is the potential for bias associated with the use of restriction digest to fragment chromatin material, as performed in many 3C, 4C, and Hi-C protocols (10,11,12). Certain chromatin regions – for example, transcriptionally active sites containing fewer histone proteins – may be relatively more accessible to endonucleases and hence be preferentially digested. Cross-linking stringency has been shown to be inversely related to restriction digest efficiency (13); furthermore, long fragments arising from incomplete digestion are selected against during PCR amplification (14). As a consequence, interacting sequences in regions with high cross-linking efficiency may be under-represented. The uneven distribution of restriction enzyme recognition sites may also contribute to bias, as different interacting segments may be over- or under-represented depending on the frequency of restriction sites in their respective genomic regions (14,15). Moreover, some restriction enzymes may perform poorly in the presence of Sodium dodecyl sulphate (SDS) and Triton X-100, both of which are used in the 4C technique to prevent aggregation of nuclei and to open up chromatin for restriction enzyme digestion (14).
The clonal amplification issue which could lead to high false positives has to a certain extent, been ameliorated in the microarray-based detection method. 4C microarray approaches work by normalizing PCR-amplified 4C data against PCR-amplified genomic background, thus normalizing clonal amplifications; followed by applying a running mean algorithm to define clusters of increased hybridization signals relative to the surrounding genomic area (4,16), hence enabling the detection of clusters indicating chromatin interactions. However, microarrays have a limited dynamic range and poor coverage of repetitive regions. Sequencing offers many advantages over microarrays: within reasonable cost limits, the dynamic range can be expanded to suit the needs of each experiment simply by increasing the total number of sequencing reads. Also, sequencing allows for direct counting. However, previous 4C studies that made use of the sequencing approach failed to exclude the possibility of sampling error as their analyses were focused on relatively small numbers of clones; moreover such studies would not have been able to eliminate the clonal amplification biases (5,6,7).
To overcome these issues and develop an unbiased 4C sequencing-based assay, we have developed a modified 4C protocol which uses sonication instead of restriction digest to fragment chromatin DNA. Sonication has the benefit of eliminating any potential problems with restriction enzyme bias, as its acoustic-based physical shearing mechanism disrupts DNA in a random manner, generating fragments a few hundred base pairs in length with a random distribution of breakpoints. Hence, when two interacting DNA fragments are joined together, their breakpoints form a ligation junction that is defined by a unique set of genomic coordinates. Sequencing across ligation junctions is thus a potentially powerful means of identifying unique sequences from independent ligation events. The presence of multiple unique sequences clustered around the same locus would then indicate a likely interaction. Coupled with high-throughput next-generation sequencing, this strategy enables the identification of multiple unique sequences, indicating possible chromatin interactions.
Overview of the sonication-based 4C technique
The 4C protocol described here is a genome-wide and unbiased approach for the de novo detection of chromatin interaction targets with a particular bait. 4C makes use of the proximity ligation concept, pioneered by the 3C method, to capture interacting DNA segments within DNA-protein complexes. As opposed to the the original 4C strategy, our new 4C strategy uses sonication, as opposed to restriction enzyme digestion, to fragment chromatin fibers, allowing for the identification of partners in an unbiased manner by next generation sequencing (Figure 1).
Sonication-based 4C experimental design
Site selection and inverse primer design. Sites for 4C analysis were selected based on ChIA-PET data, but any non-repetitive site may be used for analysis. Primers were designed using Primer3 software (http://frodo.wi.mit.edu/primer3/) (17). The RepeatMasker track in the UCSC Genome Browser was used to ensure that the primers did not lie in repeat regions (http://genome.ucsc.edu/) (18). To ensure specificity, primer sequences were analyzed by BLAT (http://genome.ucsc.edu/cgi-bin/hgBlat) (19), and only unique primer sequences were used. Flanking inverse primers should be around 25 bp long for high specificity, which is critical for the first round of inverse PCR as it selects for the products that would be amplified by nested PCR. Nested inverse primers will need to be designed with the 454 adaptor sequences attached onto the 5’ ends, such that the 454 adaptors can be incorporated into the PCR products. Hence, nested primers can be approximately 20 bp in size to reduce the cost and length of the final oligonucleotides used for nested PCR. Also, flanking inverse primers should be no more than 100 bp apart, and nested primers should be designed to be as close to the inverse primers as possible, because the probability of a randomly sheared DNA fragment containing the sequences necessary for successful priming and PCR amplification is inversely related to the genomic distance spanned by the primers. Hence, to amplify a larger and more diverse population of DNA fragments, the primers should be designed relatively close together.
Fragmentation of chromatin fibers Sonication-based 4C uses sonication instead of restriction enzyme digestion to fragment chromatin fibers. The advantages are that more regions of the genome can be interrogated than with restriction enzyme digestion, and also, unique end sequences are generated as opposed to non-unique restriction enzyme digestion ends. Unique end sequences allow clonal amplifications to be identified and removed. The unique tags can then be clustered to identify bona fide chromatin interactions.
Sequencing of 4C material 4C material may be analyzed by 454 Titanium next-generation sequencing. 4C material consists of bait-target-bait structures, and long read lengths will enable read-through past the bait into the target sequences. While we used the 250 bp read length in the experiments described here, the use of 400 bp in the new Titanium system would allow for better read-throughs of the sequences. In its present form, sonication-based 4C cannot be used with Illumina or ABI SoLiD next-generation sequencing methods.
Data analysis Data analysis is then required to identify putative interactions. First, the sequences are mapped to the genome. BLAT or BLAST may be used. As the chromatin is sonicated, the probability of generating exactly identical DNA fragments is low; hence any redundant sequences are considered to be copies amplified during the cloning and/or PCR amplification processes. Therefore, only nonredundant distinct sequences are used for further analysis. Next, the “multiple overlaps” concept is used to distinguish true signals from noise. The principle of this concept is that we expect PETs derived from nonspecific fragments to be randomly distributed in the genome as background sequences, whereas interacting sequences derived from the same bona fide interactions will overlap with each other to form a cluster of interacting sequences.
Validations. These can be performed using the 3C method, as well as FISH, to confirm whether there is an interaction. FISH is particularly useful as an orthogonal validation method that employs very different techniques from 4C to visualize the interaction. FISH probes may also then be used to study the interaction in clinical samples that generally involve very small amounts of cells. It should be noted that because FISH is limited by low resolution, it can only be performed on interactions that exceed 1 Mb.
Applications of the 4C method
This protocol may be used to interrogate chromatin interactions that interact with a genomic region of interest, and can serve as a validation method for ChIA-PET, 5C, and other genome-wide analyses. With 4C, targeted questions may be asked in specific genome regions, for example, in analyses of the keratin gene cluster. Our analyses of the keratin gene cluster suggest that keratin genes may be brought together by chromatin interactions for coordination of transcription. Moreover, 4C may be combined with ChIP in order to interrogate chromatin interactions bound by particular proteins.
General points to note:
The starting material is cells. The chromatin should be isolated from about 10e6 – 10e7 cells such that there will be sufficient chromatin material for library construction and the resulting library will be of high complexity. The cells may be treated with drugs as appropriate (Appendix I). The results shown came from a 4C library prepared from estrogen-treated MCF-7 cells.
A. Chromatin preparation
Timing: ~ 1-2 days
B. 4C library preparation
Timing: ~ 3-4 days
C. 4C library amplification
Timing: ~ 2-3 days
Possible Problem 1: Quality Control run of annealed linkers yield two or more discrete bands. The reason might be that unequal molar amounts of oligonucleotides resulting in incomplete annealing. The solution is to test different ratios of oligos to determine the optimal ratio for stoichiometric annealing; if necessary, run an Agilent 1000 DNA chip to double check.
Possible Problem 2: Quality Control run of PCR products shows a very weak smear/no smear. Possible reasons could include sub-optimal PCR cycle conditions, in which case solutions may include increasing the number of cycles used (do not use more than 25 cycles), decreasing annealing temperature or increasing elongation time, and another possible reason might be insufficient template DNA, in which case the solution is to Increase the starting amount of template DNA in the PCR reaction.
Possible Problem 3: Very bright smear/ doughnut-shaped band/ no band observed after PCR scale-up. Possible reasons may be that too much DNA was loaded onto the PAGE gel. Solutions include decreasing the number of PCR reactions, or splitting samples up into more wells.
Possible Problem 4: Library has many repeated reads. Possible reasons may be that the library has low complexity. To minimize this problem, use high amounts of starting and template DNA to maximize the amount of DNA used in the PCR; reduce the number of PCR cycles. We have found that despite trying these options, many libraries still have many repeated reads, suggesting that chromatin interactions are inherently rare events.
Possible Problem 5: Poor results observed upon sequencing and data analysis (few/ low quality chromatin interactions). One reason may be that there were problems with upstream chromatin preparation procedures; eg. poor cross-linking. Troubleshoot chromatin preparation procedures – ensure that formaldehyde used is fresh and functional, and ensure that sonication worked by running a quality control agarose gel. Another reason may be that the region of interest is repetitive. To troubleshoot, check to ensure that a region which is not highly repetitive (and hence difficult to analyze by sequencing followed by unique mapping) is not used.
Possible Problem 6: Mapping errors are observed (mapping is wrong upon manual double-checking of a few examples using UCSC BLAT and other mapping methods). This could be because the mapping was incorrectly done. To troubleshoot, ensure that only unique mappings are used to identify chromatin interactions.
A successful library preparation would show the following successful quality controls: (1) Well-sonicated chromatin (Figure 2a); (2) A smear of about 200 to more than 500 bp in the quality control gel run after PCR (Figure 2b) (3) chromatin interactions following library sequencing (Figure 2c-e).
In an experiment performed on a keratin gene cluster region, 454 GSFLX (a prior version to Titanium) sequencing generated approximately half a million sequences. All the sequences were mapped to reference genome (hg18 human genome assembly) to identify the target regions in relation to the bait region, and 0.44 million sequences (95%) showed at least one hit to the genome. Sequences that did not show at least two mapping regions were filtered away, as these could be incompletely sequenced ligation products or DNA sequences that did not ligate. 95,050 (21.6%) reads had at least 2 hits, with the first hit mapping correctly to the primer site. In further experiments, the longer sequencing read lengths offered by the new 454 Titanium system, as well as even longer ligation times, could address this issue.
Sequences were filtered to remove redundant clonal amplifications (repeated sequences). Of 95,050 reads, 3660 (3.9%) sequences were found to be unique. Many redundant clonal amplification sequences were observed, indicating that clonal amplifications within 4C libraries are indeed an issue. Previous use of restriction enzymes to fragment the chromatin would not have been able to distinguish clonal amplifications from bona fide enriched chromatin interaction signals. In future experiments, further increasing the amount of starting template, and amount used as PCR template, could help to reduce the redundancy. Also, while reducing the number of PCR cycles could reduce amount of selection performed on the 4C library and hence increase non-specific 4C ligation noise, this modification might also reduce the amount of redundancy seen in the library.
The majority of the non-redundant sequences either mapped randomly along the genome as potential non-specific 4C products, or mapped to the “bait” region within 1 kb, suggesting self-ligation products in the 4C experiment. 3,429 (93.7%) sequences were intrachromosomal ligation products, and they comprised 3388 (98.8%) self-ligation products, 30 (0.9%) inter-ligation products representing expected interactions, and 11 (0.3%) other long-range inter-ligations typical of random noise. Manual inspection of the 30 sequences that mapped to expected interaction loci revealed 7 unique ligation events, which were then collapsed into 4 distinct interactions. 2 interaction clusters (2 or more overlapping unique PETs) were found, namely chr12:50828808-50880947 (genomic span=52 kb; 2 ligation events) and chr12:50828801-50883887 (genomic span=54 kb; 3 ligation events). The remaining 2 interactions – chr12:50828832-51024898, with a genomic span of 196 kb, and chr12:50828839-51575465, with a genomic span of 746.6 kb – were represented by single unique ligation events (Figure 2c-e).
We noted that the 4C data showed a very clean background (Figure 2d, e). From the bait region to the first and the second interaction sites (about 200 kb distance), there are no background sequences in the intervals. Given that we prepared the chromatin material used for our 4C analysis by sonication, which is different from the standard 3C and 4C protocols, this result suggests that the sonication method could be very efficient in “shaking off” non-specific chromatin fragments randomly attached to the specific chromatin interaction complexes. We expect to obtain discrete interaction peaks formed by clusters of inter-ligation sequences from sonicated material, because we expect real interactions to been captured by proximity ligation process. While the detached non-specific chromatin fragments would still be present in the DNA pool, they would not be amplified by the 4C PCR detection method.
We specifically looked at the KRT gene cluster site (previously identified by ChIA-PET (9)), where the “bait” region lay, from which we designed our inverse 4C PCR primers. Moving right from the “bait” region, we identified overlapping sequence clusters that correlated very well with the locations of the interaction sites identified by ChIA-PET data, cross-validating both the ChIA-PET data as well as the 4C protocol (Figure 2c, e).
Interestingly, analysis of chromatin interactions by 4C and ChIA-PET in the keratin region suggests that chromatin interactions are correlated with gene expression coordination. Both ChIA-PET and 4C data shows that KRT7, KRT8, and KRT18 are all pulled into the “hub” of the same interaction complex. KRT7, 8, and 18 are known to be expressed in breast carcinomas. In particular, KRT8 and KRT18 are tightly coexpressed genes, and the gene products bind tightly to each other, pairing up by the formation of a heterodimer between KRT8 (a “type II” keratin) and KRT18 (a “type I” keratin). Without the formation of a heterodimer, type I and type II keratins are rapidly degraded (20). These two genes are connected by many inter-ligations. By contrast, KRT5, 6, 1, 2, and other keratins involved in other aspects such as in hair development for example KRT72 and KRT75, are not expressed, and they are present in the “loop” of the interaction complex. Hence, chromatin interactions in the keratin region may bring together relevant genes into transcriptional foci, and loop out irrelevant genes, in order to achieve tightly coordinated gene expression regulation.
In conclusion, our novel sonication-based 4C protocol has enabled the identification of bona fide chromatin interactions for ChIA-PET validation of chromatin interactions in the keratin gene cluster, demonstrating that chromatin interactions in the keratin cluster may function to coordinate gene transcription. With sonication, non-specific noise could be “shook off”, thus reducing the very high non-specific noise seen in the original 3C and 4C protocols. Moreover, with sonication as opposed to restriction enzyme digestion, a previously unrecognized problem with regards to high amounts of sequenced clonal amplifications may be reduced. With further optimizations, sonication-based 4C could become a robust method for use in conjunction with next-generation sequencing to identify and study chromatin interactions. Moreover, sonication-based 4C could be coupled with faster and cheaper third-generation sequencing methods such as Pacific Biosciences (21) as they become available. With the improved throughput, reduced sample requirements of the new third-generation sequencing methods, it may become possible to omit part C of the sonication-based 4C protocol presented here, and simply sequence the entire library of proximity-ligated sonicated chromatin, which would allow for ultra-high-throughput analysis of global chromatin interactions at high resolution.
The authors acknowledge the Genome Technology and Biology Group and the Cancer Biology and Pharmacology group at the Genome Institute of Singapore for technical support in developing the protocol. The authors acknowledge the bioinformatics group supervised by Dr Ken Sung, as well as Mr. Atif Shahab, Mr. Chan Chee Seng, and Mr. Fabianus H. Mulawadi for computing support; and Drs Shujun Luo and Gary Schroth for Illumina sequencing support. M.J.F., P.Y.H.H., and Y.H. are supported by ASTAR Scholarships. M.J.F. is supported by a 2009 L’Oreal For Women In Science National Fellowship and a 2010 Lee Kuan Yew Post-Doctoral Fellowship. Y.R. and C.L.W. are supported by ASTAR of Singapore and NIH ENCODE grants (R01 HG004456-01, R01HG003521-01, and part of 1U54HG004557-01).
Figure 1. : Schematic comparison of 4C procedures.
a. Outline of the original 4C method. b. Outline of the sonication-based 4C method.
Figure 2.: 4C validations
a. Sonication quality control gel showing that the chromatin has been successfully fragmented into sizes of about 200 – 2000 bp. b. The 4C PCR products using the “bait” primer pair based at the KRT chromatin interaction region. The boxed range of DNA amplicon was gel-excised for sequencing analysis. c. Chromatin interactions at the KRT gene cluster identified by ERα ChIA-PET analysis. d. An enlarged view (10 Mb) of 4C sequence mapping centered at the KRT gene cluster shows that the 4C data is very clean. e. The 4C sequences mapped at the KRT gene cluster locus, aligned with the view in c. The highest 4C sequence mapping peak is at the 4C “bait” site (indicated by a blue dot). The interaction anchors of this interaction complex were mapped 4C sequences.
Supplementary Document 1: Appendices
An oestrogen-receptor-α-bound human chromatin interactome. Melissa J. Fullwood, Mei Hui Liu, You Fu Pan, Jun Liu, Han Xu, Yusoff Bin Mohamed, Yuriy L. Orlov, Stoyan Velkov, Andrea Ho, Poh Huay Mei, Elaine G. Y. Chew, Phillips Yao Hui Huang, Willem-Jan Welboren, Yuyuan Han, Hong Sain Ooi, Pramila N. Ariyaratne, Vinsensius B. Vega, Yanquan Luo, Peck Yean Tan, Pei Ye Choy, K. D. Senali Abayratna Wansa, Bing Zhao, Kar Sian Lim, Shi Chi Leow, Jit Sin Yow, Roy Joseph, Haixia Li, Kartiki V. Desai, Jane S. Thomsen, Yew Kok Lee, R. Krishna Murthy Karuturi, Thoreau Herve, Guillaume Bourque, Hendrik G. Stunnenberg, Xiaoan Ruan, Valere Cacheux-Rataboul, Wing-Kin Sung, Edison T. Liu, Chia-Lin Wei, Edwin Cheung, and Yijun Ruan. Nature 462 (7269) 58 - 64 05/11/2009 doi:10.1038/nature08497
Phillips Y.H. Huang, Yuyuan Han, Lusy Handoko, Stoyan Velkov, Eleanor Wong, Xiaoan Ruan, Chia-Lin Wei, Melissa Jane Fullwood & Yijun Ruan, Genome Technology and Biology, Genome Institute of Singapore
Edwin Cheung, Cancer Biology and Pharmacology, Genome Institute of Singapore
Source: Protocol Exchange (2010) doi:10.1038/protex.2010.207. Originally published online 14 December 2010.