Click here to close
Hello! We notice that you are using Internet Explorer, which is not supported by Xenbase and may cause the site to display incorrectly.
We suggest using a current version of Chrome,
FireFox, or Safari.
PLoS Comput Biol
2016 Dec 16;1212:e1005249. doi: 10.1371/journal.pcbi.1005249.
Show Gene links
Show Anatomy links
Improved Prediction of Non-methylated Islands in Vertebrates Highlights Different Characteristic Sequence Patterns.
Huska M
,
Vingron M
.
???displayArticle.abstract???
Non-methylated islands (NMIs) of DNA are genomic regions that are important for gene regulation and development. A recent study of genome-wide non-methylation data in vertebrates by Long et al. (eLife 2013;2:e00348) has shown that many experimentally identified non-methylated regions do not overlap with classically defined CpG islands which are computationally predicted using simple DNA sequence features. This is especially true in cold-blooded vertebrates such as Danio rerio (zebrafish). In order to investigate how predictive DNA sequence is of a region's methylation status, we applied a supervised learning approach using a spectrum kernel support vector machine, to see if a more complex model and supervised learning can be used to improve non-methylated island prediction and to understand the sequence properties of these regions. We demonstrate that DNA sequence is highly predictive of methylation status, and that in contrast to existing CpG island prediction methods our method is able to provide more useful predictions of NMIs genome-wide in all vertebrate organisms that were studied. Our results also show that in cold-blooded vertebrates (Anolis carolinensis, Xenopus tropicalis and Danio rerio) where genome-wide classical CpG island predictions consist primarily of false positives, longer primarily AT-rich DNA sequence features are able to identify these regions much more accurately.
Fig 1. Receiver Operating Characteristic curves show that the DNA sequence is highly predictive of non-methylated regions, and our SVM method achieves higher AUROC than other methods when predicting these regions.A receiver operating characteristic curve for four different classifiers: SVM (our spectrum kernel SVM), CpG ratio (the ratio of observed versus expected CpG dinucleotides), UCSC CpG island predictions (a variant of the observed versus expected method with additional constraints), and Wu HMM (an HMM-based CpG island prediction method), as well as an SVM trained on sequences with randomly shuffled labels, “SVM (random)”. The UCSC and Wu HMM methods are shown as points rather than curves, because they only provide a set of genomic windows rather than scores for the whole genome, essentially the same as choosing a single cutoff score for the other methods. The prediction was run five times with different random splits of training and test data, therefore five lines or points are shown for each method. The performance is very stable between runs, with the lines for each run almost perfectly overlapping. The average area under the curve across all 5 random splits is indicated in each panel.
Fig 2. Precision-Recall curves show that the SVM is better able to provide genome-wide NMI predictions while controlling for false positives.These curves plot the relationship between the fraction of correctly identified regions (precision) versus the fraction of all NMIs that are identified (recall). The SVM method performs better than all other methods, with a higher AUPRC in every organism. The CpG ratio method performs very poorly cold-blooded vertebrates (lizard, frog and zebrafish), and in the case of frog never achieves a precision higher than 0.1 regardless of recall. The average area under the curve across all 5 random splits of the genome into parameter tuning, training and test sets is indicated in each panel. The performance of an SVM classifier trained on sequences with randomly shuffled labels, “SVM (random)”, is shown in grey.
Fig 3. Longer DNA subsequences are required to accurately identify NMIs in cold-blooded vertebrates.While the frequency of di- and tri-nucleotides is already highly predictive of NMI status in warm-blooded vertebrates (AUROC >0.97), the frequencies of k-mers of length 6 or more are required for accurate prediction of NMIs in cold-blooded vertebrates. Box plots show the prediction performance on the parameter tuning set across 5 runs of 5-fold cross validation. The data sets for all organisms consist of 30,000 750bp windows with a 5:1 ratio of non-NMI windows to NMI windows. This fixed number of windows and fixed class imbalance means that both the AUROC (a) and AUPRC (b) can be compared across organisms.
Fig 4. Example regions in frog and lizard showing the improvement of using longer k-mers rather than CpG-based measures Two example regions from (A) Anolis carolinensis and (B) Xenopus tropicalis.In both cases there are NMIs that contain stretches of relatively low CpG content, which are either poorly predicted (in lizard) or not predicted at all (in frog) using CpG ratios, UCSC CpG island predictions or the Wu HMM method. Nevertheless they are quite accurately predicted using the SVM-based method that uses longer k-mers as features.
Antequera,
Number of CpG islands and genes in human and mouse.
1993, Pubmed
Antequera,
Number of CpG islands and genes in human and mouse.
1993,
Pubmed
Bird,
DNA methylation and the frequency of CpG in animal DNA.
1980,
Pubmed
Bird,
Methylation-induced repression--belts, braces, and chromatin.
1999,
Pubmed
Blackledge,
Bio-CAP: a versatile and highly sensitive technique to purify and characterise regions of non-methylated DNA.
2012,
Pubmed
Bock,
CpG island methylation in human lymphocytes is highly correlated with DNA sequence, repeats, and predicted DNA structure.
2006,
Pubmed
Cooper,
Unmethylated domains in vertebrate DNA.
1983,
Pubmed
Coulondre,
Molecular basis of base substitution hotspots in Escherichia coli.
1978,
Pubmed
Cross,
Non-methylated islands in fish genomes are GC-poor.
1991,
Pubmed
Das,
Computational prediction of methylation status in human genomic sequences.
2006,
Pubmed
Davuluri,
Computational identification of promoters and first exons in the human genome.
2001,
Pubmed
Deaton,
CpG islands and the regulation of transcription.
2011,
Pubmed
Derrien,
Fast computation and applications of genome mappability.
2012,
Pubmed
Elango,
DNA methylation and structural and functional bimodality of vertebrate promoters.
2008,
Pubmed
Fang,
Predicting methylation status of CpG islands in the human brain.
2006,
Pubmed
Frommer,
A genomic sequencing protocol that yields a positive display of 5-methylcytosine residues in individual DNA strands.
1992,
Pubmed
Gardiner-Garden,
CpG islands in vertebrate genomes.
1987,
Pubmed
Ghandi,
Enhanced regulatory sequence prediction using gapped k-mer features.
2014,
Pubmed
Kent,
The human genome browser at UCSC.
2002,
Pubmed
Lee,
Discriminative prediction of mammalian enhancers from DNA sequence.
2011,
Pubmed
Leslie,
The spectrum kernel: a string kernel for SVM protein classification.
2002,
Pubmed
Lewin,
Every genome sequence needs a good map.
2009,
Pubmed
Long,
Epigenetic conservation at gene regulatory elements revealed by non-methylated DNA profiling in seven vertebrates.
2013,
Pubmed
,
Xenbase
Long,
Protection of CpG islands from DNA methylation is DNA-encoded and evolutionarily conserved.
2016,
Pubmed
Meissner,
Reduced representation bisulfite sequencing for comparative high-resolution DNA methylation analysis.
2005,
Pubmed
Mendizabal,
Whole-genome bisulfite sequencing maps from multiple human tissues reveal novel CpG islands associated with tissue-specific regulation.
2016,
Pubmed
Saito,
The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets.
2015,
Pubmed
Sandelin,
JASPAR: an open-access database for eukaryotic transcription factor binding profiles.
2004,
Pubmed
Saxonov,
A genome-wide analysis of CpG dinucleotides in the human genome distinguishes two distinct classes of promoters.
2006,
Pubmed
Schweikert,
mGene: accurate SVM-based gene finding with an application to nematode genomes.
2009,
Pubmed
Sharif,
Divergence of CpG island promoters: a consequence or cause of evolution?
2010,
Pubmed
Song,
Association of tissue-specific differentially methylated regions (TDMs) with differential gene expression.
2005,
Pubmed
Wu,
Redefining CpG islands using hidden Markov models.
2010,
Pubmed
Ziller,
Charting a dynamic DNA methylation landscape of the human genome.
2013,
Pubmed
van Heeringen,
Principles of nucleation of H3K27 methylation during embryonic development.
2014,
Pubmed
,
Xenbase