Click here to close
Hello! We notice that you are using Internet Explorer, which is not supported by Xenbase and may cause the site to display incorrectly.
We suggest using a current version of Chrome,
FireFox, or Safari.
Database (Oxford)
2013 Jan 09;2013:bat053. doi: 10.1093/database/bat053.
Show Gene links
Show Anatomy links
MisPred: a resource for identification of erroneous protein sequences in public databases.
Nagy A
,
Patthy L
.
???displayArticle.abstract???
Correct prediction of the structure of protein-coding genes of higher eukaryotes is still a difficult task; therefore, public databases are heavily contaminated with mispredicted sequences. The high rate of misprediction has serious consequences because it significantly affects the conclusions that may be drawn from genome-scale sequence analyses of eukaryotic genomes. Here we present the MisPred database and computational pipeline that provide efficient means for the identification of erroneous sequences in public databases. The MisPred database contains a collection of abnormal, incomplete and mispredicted protein sequences from 19 metazoan species identified as erroneous by MisPred quality control tools in the UniProtKB/Swiss-Prot, UniProtKB/TrEMBL, NCBI/RefSeq and EnsEMBL databases. Major releases of the database are automatically generated and updated regularly. The database (http://www.mispred.com) is easily accessible through a simple web interface coupled to a powerful query engine and a standard web service. The content is completely or partially downloadable in a variety of formats. DATABASE URL: http://www.mispred.com.
Figure 1. MisPred annotation of an erroneous protein sequence. The figure shows the entry for a protein sequence of X. tropicalis deposited in NCBI/RefSeq database with the protein ID: NP_001072931.1 and in the UniProtKB/TrEMBL database with the protein ID: Q08CW3_XENTR. The protein was identified as erroneous by MisPred tool 4 (domain size deviation) because it contains only a fragment of a domain (Pfam-A domain PF01822, WSC).
Figure 2. MisPred analysis of a protein sequence for potential sequence errors. The sequence shown in Figure 1 was analysed with the various MisPred tools. This figure shows basic information about the input protein sequence (automatically generated sequence ID, species name, protein sequence, task status and date and time of the completion of the analysis).
Figure 3. MisPred analysis of a protein sequence for potential sequence errors. The sequence shown in Figure 1 was analysed with the various MisPred tools. The figure shows the primary conclusions based on the analyses for signal peptide, Pfam-A domains, transmembrane helix, GPI anchor, domain-size integrity and chromosomal localization of the exons encoding the protein. In the rows showing the Pfam-A domains present in this protein, the different characters represent the output of the HMMscan program. For example, in the first row, the characters (from left to right) indicate the Model used (ls), the domain type identified (PF00051.10), the number of copies of this domain type in this protein (1), the first and last residues of the domain, defined by residue numbering of this protein (25 106), the first and last residues of the HMM of this domain type that align with PF00051 of this protein (1 85), the score of the match (84.6) and the E-value of the match (2.1 e-24). Note that these analyses revealed that the protein is a secreted extracellular protein that contains a secretory signal peptide and two types of extracellular domains. In harmony with the extracellular localization of the protein, it does not contain intracellular signaling domains, nuclear domains or transmembrane helices. However, the protein is erroneous in as much as one of its extracellular protein domains, the Pfam-A domain PF01822 (WSC-domain) is truncated, an error that is detected by MisPred tool 4 (domain-size deviation).
Figure 4. MisPred analysis of a protein sequence for potential sequence errors. The sequence shown in Figure 1 was analysed with the various MisPred tools. This figure summarizes the conclusions: the sequence violates only one of the MisPred rules: the size of one of its domains deviates significantly from the size typical of the given domain family. Note that conflict 11 is missing from the type of sequence errors, as MisPred tool 11 is not yet available in searches on the MisPred website. This tool will be released in the next update of MisPred.
Alioto,
Gene prediction.
2012,
Pubmed
Bendtsen,
Improved prediction of signal peptides: SignalP 3.0.
2004,
Pubmed
Finn,
HMMER web server: interactive sequence similarity searching.
2011,
Pubmed
Flicek,
Ensembl 2012.
2012,
Pubmed
Guigó,
EGASP: the human ENCODE Genome Annotation Assessment Project.
2006,
Pubmed
Guo,
Pervasive indels and their evolutionary dynamics after the fish-specific genome duplication.
2012,
Pubmed
Harrow,
Identifying protein-coding genes in genomic sequences.
2009,
Pubmed
Hiller,
PrediSi: prediction of signal peptides and their cleavage positions.
2004,
Pubmed
Kent,
BLAT--the BLAST-like alignment tool.
2002,
Pubmed
Krogh,
Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes.
2001,
Pubmed
Käll,
Advantages of combined transmembrane topology and signal peptide prediction--the Phobius web server.
2007,
Pubmed
Marchler-Bauer,
CDD: conserved domains and protein three-dimensional structure.
2013,
Pubmed
Nagy,
Reassessing domain architecture evolution of metazoan proteins: the contribution of different evolutionary mechanisms.
2011,
Pubmed
Nagy,
Identification and correction of abnormal, incomplete and mispredicted proteins in public databases.
2008,
Pubmed
,
Xenbase
Nagy,
Reassessing domain architecture evolution of metazoan proteins: major impact of gene prediction errors.
2011,
Pubmed
Prosdocimi,
Controversies in modern evolutionary biology: the imperative for error detection and quality control.
2012,
Pubmed
Pruitt,
NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy.
2012,
Pubmed
Punta,
The Pfam protein families database.
2012,
Pubmed
Tordai,
Modules, multidomain proteins and organismic complexity.
2005,
Pubmed
UniProt Consortium,
Reorganizing the protein space at the Universal Protein Resource (UniProt).
2012,
Pubmed
Zhang,
Limitations of the rhesus macaque draft genome assembly and annotation.
2012,
Pubmed