Summary of the methods that are used to derive the DescribePROT database

DescribePROT uses 11 accurate and complementary tools covering sequence conservation and a wide variety of putative structural and functional properties of proteins at the amino acid level. We discuss quality of the results generated by these tools (listed alphabetically) based on their published results. For further details, you can click on the references and the PMID reference link to directly access the publications.

    ASAquick (Prediction of the solvent accessible surface area)[PMID:27787824]
      This method relies on features extracted directly from the protein sequence, avoiding the use of computationally expensive multiple sequence alignment, which considerably speeds up its prediction process. The predictive performance of ASAquick was measured using Pearson’s Correlation coefficient (PCC) between native and predicted solvent accessibility values [1] . PCC ranges from -1 to 1, where 0 indicates no correlation between the variables and higher positive value indicates stronger predictive performance. When assessed on a low sequence similarity test dataset, ASAquick achieves PCC = 0.66, which indicates a strong positive correlation between the predicted and native values [1].
    DFLpred (Prediction of disordered flexible linker residues)[PMID:27307636]
      DFLpred is a very fast method that predicts disordered linker residues in protein sequences. Performance of this method was tested on a low sequence similarity benchmark dataset using the Area under the Receiver Operating Characteristic Curve (AUC), a popular metric that evaluates the real-valued propensity scores that are generated by this tool [2]. AUC of 0.5 suggests random levels of predictive performance, while values closer to 1 indicate stronger predictive quality. DFLpred secures AUC = 0.72, outperforming several alternative tools that include methods for the prediction of flexible linkers, flexible residues, intrinsically disordered residues and various combinations of these methods [2].
    DisoRDPbind (Prediction of disordered nucleic acids and protein binding residues)[PMID:26109352]
      This is a fast and accurate method that predicts disordered residues that bind DNA, bind RNA and bind proteins. This method was evaluated on several low sequence similarity test datasets using the AUC metric [3, 4]. Depending on the binding partners and test dataset, the DisoRDPbind’s AUCs range between 0.66 and 0.72 for the protein binding, 0.64 and 0.67 for the DNA binding, and 0.66 and 0.67 for the RNA binding [3, 4]. Moreover, DisoRDPbind secured the second-best result for the prediction of the disordered binding residues in the Critical Assessment of protein Intrinsic Disorder prediction (CAID) experiment, with AUC= 0.73 [5].
    DRNApred (Prediction of nucleic acids binding residues)[PMID:28132027]
      This is a fast tool which predicts DNA-binding and RNA-binding residues in protein sequences. The predictive performance of DRNApred was assessed on a low sequence similarity benchmark dataset using the AUC metric [6]. It secures AUC of 0.78 and 0.68 for the prediction of the DNA and RNA binding amino acids, respectively, and was empirically shown to provide low amounts of cross-predictions (i.e., predictions where DNA binding residues are confused for the RNA binding residues and vice versa) [6].
    flDPnn (Prediction of intrinsically disordered residues) [PMID:34290238]
      flDPnn is a fast and accurate method that predicts intrinsically disordered amino acids in protein sequences. When tested on a low sequence similarity test dataset, flDPnn obtains AUC of 0.84 [7]. Moreover, flDPnn was shown to outperform 42 other disorder predictors on the DisProt dataset in the Critical Assessment of protein Intrinsic Disorder prediction (CAID) experiment, securing AUC = 0.81 [5].
    MMseqs2 (Fast sequence alignment) [PMID:30615063]
      MMseqs2 produces accurate multiple sequence alignments of protein and nucleotide sequences in a few seconds [8]. When evaluated on non-redundant datasets against popular alignment tools like BLAST, DIAMOND and HMMER3, MMseqs2 results are produced about 30 times faster than BLAST and DIAMOND and 300 times faster than HMMER3. Moreover, MMseqs2’s sequence-to-profile searches secure 87% sensitivity at 95% precision [8].
    MoRFchibi_Light (Prediction of MoRF regions) [PMID:27174932]
      This tool predicts MoRF regions, which are disordered regions that fold upon binding with peptides and proteins. Using low sequence similarity test datasets, MoRFchibi_Light obtains AUC of 0.87 for short MoRFs and 0.77 for long MoRFs (more than 30 residues) and is reported to process more than 10,500 residues per minute [9]. Moreover, MoRFchibi_Light was shown to be the third most accurate method for the prediction of disordered binding regions in the Critical Assessment of protein Intrinsic Disorder prediction (CAID) experiment [5].
    MusiteDeep (Prediction of PTM sites) [PMID:32324217]
      This method predicts nine major types of posttranslational modifications (PTMs) [10]. When assessed on a low sequence similarity test set, MusiteDeep offers AUC of 0.93 for phosphorylation, 0.96 for glycosylation, 0.80 for ubiquitination, 0.99 for SUMOylation, 0.98 for acetylation, 0.94 for methylation, 0.86 for hydroxylation, 0.98 for pyrrolidone carboxylic acid, and 0.96 for palmitoylation [11]. An overall AUC across all PTM types is 0.93, representing state-of-the-art performance for the PTM predictions [11].
    PSIPRED (Prediction of secondary structures) [PMID:31251384]
      This popular and accurate tool predicts the 3-state secondary structure of proteins (helix, strand and coil). The prediction quality of the method is assessed using the Q3 score that measures the overall accuracy of predicting the three types of secondary structures and ranges between 0 and 100% [12]. A recent assessment of PSIPRED reveals that it secures the Q3 score of 84.2% [12].
    SCRIBER (Prediction of protein binding residues) [PMID:31510679]
      SCRIBER predicts protein-binding amino acids in protein chains. When evaluated on a test dataset with low sequence similarity proteins, this tools obtains AUC of 0.72 and accuracy = 0.82, outperforming seven other popular predictors of protein-binding residues [13]. SCRIBER also secures the lowest cross-prediction rate of 0.116, which is defined as the fraction of the other types of binding residues (DNA-, RNA- and small ligand-binding residues) that are cross-predicted as the protein-binding residues [13].
    SignalP (Prediction of signal peptides) [PMID:30778233]
      This tool predicts signal peptides, which are short amino acid stretches found taxonomy-wide in the N-terminus of a nascent polypeptide chain that targets membrane-bound export systems. Empirical tests of SignalP’s predictive performance that rely on low sequence similarity proteins report the Matthew’s Correlation Coefficient (MCC) [14]. MCC values range between -1 and 1, where values closer to 1 indicates higher predictive quality. Depending on the taxonomic group and the signal peptide type, SignalP’s MCC ranges between 0.938 and 0.977 for Archaea, 0.907 and 0.981 for gram-negative Bacteria, 0.890 and 0.957 for gram-positive Bacteria, and 0.966 for Eukaryota [14].

References

    [1] Faraggi E, Zhou YQ, Kloczkowski A. Accurate single-sequence prediction of solvent accessible surface area using local and global features. Proteins. 2014;82:3170-6.
    [2] Meng F, Kurgan L. DFLpred: High-throughput prediction of disordered flexible linker regions in protein sequences. Bioinformatics. 2016;32:i341-i50.
    [3] Peng ZL, Kurgan L. High-throughput prediction of RNA, DNA and protein binding regions mediated by intrinsic disorder. Nucleic Acids Research. 2015;43.
    [4] Peng ZL, Wang C, Uversky VN, Kurgan L. Prediction of Disordered RNA, DNA, and Protein Binding Regions Using DisoRDPbind. Prediction of Protein Secondary Structure. 2017;1484:187-203.
    [5] Necci M, Piovesan D, Predictors C, DisProt C, Tosatto SCE. Critical assessment of protein intrinsic disorder prediction. Nat Methods. 2021;18:472-81.
    [6] Yan J, Kurgan L. DRNApred, fast sequence-based method that accurately predicts and discriminates DNA- and RNA-binding residues. Nucleic Acids Research. 2017;45.
    [7] Hu G, Katuwawala A, Wang K, Wu ZH, Ghadermarzi S, Gao JZ, et al. flDPnn: Accurate intrinsic disorder prediction with putative propensities of disorder functions. Nat Commun. 2021;12.
    [8] Mirdita M, Steinegger M, Söding J. MMseqs2 desktop and local web server app for fast, interactive sequence searches. Bioinformatics. 2019;35:2856-8.
    [9] Malhis N, Jacobson M, Gsponer J. MoRFchibi SYSTEM: software tools for the identification of MoRFs in protein sequences. Nucleic Acids Research. 2016;44:W488-W93.
    [10] Wang DL, Zeng S, Xu CH, Qiu WR, Liang YC, Joshi T, et al. MusiteDeep: a deep-learning framework for general and kinase-specific phosphorylation site prediction. Bioinformatics. 2017;33:3909-16.
    [11] Wang DL, Liu DP, Yuchi JK, He F, Jiang YX, Cai ST, et al. MusiteDeep: a deep-learning based webserver for protein post-translational modification site prediction and visualization. Nucleic Acids Research. 2020;48:W140-W6.
    [12] Jones DT. Protein secondary structure prediction based on position-specific scoring matrices. Journal of Molecular Biology. 1999;292:195-202.
    [13] Zhang J, Kurgan L. SCRIBER: accurate and partner type-specific prediction of protein-binding residues from proteins sequences. Bioinformatics. 2019;35:I343-I53.
    [14] Armenteros JJA, Tsirigos KD, Sonderby CK, Petersen TN, Winther O, Brunak S, et al. SignalP 5.0 improves signal peptide predictions using deep neural networks. Nat Biotechnol. 2019;37:420-423.