CRpred method for in-silico prediction of catalytic residues

 
This web page provides datasets and prediction model associated with 


Zhang T, Zhang H, Chen K, Shen S, Ruan J, Kurgan LA, 2008. Accurate sequence-based prediction of catalytic residues, Bioinformatics24(20):2329-2338

  1. Datasets
    Each dataset is packaged into one zip file containing all the sequences. Each sequence is stored in one file, in which each line represents one residue and includes the following information separated by colon:
    1) residue type in single letter encoding (Column 1)
    2) catalytic annotation (Column 2)
      1 represents catalytic residue; -1 represents non-catalytic residue.
    3) feature values computed for that residue
      13 selected ResType features (Column 3-15). Click for details.
      166 selected PSSM features (Column 16-181). Click for details.
      9 selected EntWOP features (Column 182-190). Click for details.
      15 selected CRPair features (Column 191-205). Click for details.
      7 selected AveCH features (Column 206-212). Click for details.
     
    The EF fold dataset can be downloaded from here:
    EF_fold
    The EF superfamily dataset can be downloaded from here: EF_superfamily
    The EF family dataset can be downloaded from here: EF_family
    The HA superfamily dataset can be downloaded from here: HA_superfamily
    The NN dataset can be downloaded from here: NN
    The PC dataset can be downloaded from here: PC
    The T-124 dataset can be downloaded from here: T-124
    The T-37 dataset can be downloaded from here: T-37
     
    The ST-1109 dataset is used for statistical analysis. The list of 1109 protein chains it contains is given here:
    ST-1109
     
  2. Prediction model
    The model is in WEKA's format, and implements the RBF-kernel based Support Vector Machine classifier.
    It can be downloaded from here: 
    CRPred model.

  3. Instructions to perform predictions with CRpred
    The user should use the following procedure:
  1. Download and install WEKA platform. This free, open source platform can be dowloaded from here: http://www.cs.waikato.ac.nz/ml/weka/index_downloading.html
  2. Download the wrapper class for the libsvm tools and add it into WEKA classpath. Detailed information can be found here: http://www.cs.iastate.edu/~yasser/wlsvm/
  3. Download and save the CRPred model in a root folder where the WEKA was installed.
  4. In the same folder, create a file that stores the input for the prediction. Example file can be dowloaded from here: example input.
    Note that this file includes values of the five types of features + the class label (classification target, 1 for catalytic residue, -1 for non-catalytic residue), which could be used to automate evaluation of the prediction results (user can use dummy values if the true outcomes are unknown). This file can include multiple lines with data, which allows predicting multiple sequences in a single run.
  5. Open command line window and navigate to the directory where the model and the input file are located.
  6. Execute the following command
    java -classpath weka.jar;libsvm.jar;wlsvm.jar weka.classifiers.functions.LibSVM -l CRPred.model -T example.arff -p 0
    where 
    weka.classifiers.functions.LibSVM specifies location of the engine that runs Support Vector Machine classifier, -l specifies location of the file with the prediction model, -T specifies location of the file with data to predict, and -p specifies how the results are displayed.
    Additional help with respect to command line execution of models in WEKA can be found here:
    http://weka.sourceforge.net/wekadoc/index.php/en%3APrimer
  7. Read the prediction(s) from the screen. 
    The first column provides the serial number, the second column provides the actual class label (taken from the input file), the third column provides the predicted class label, and the last column provides the probability estimate associated with the prediction. Incorrect predictions are marked with "+". The output for the provided example has two predictions:
     
    "1 2:-1 1:1 + 0.587", which means that the sample with serial number 1 is predicted as catalytic residue (labeled as "1") with a probability of 0.587, while the actual class stored in the input file is non-catalytic residue (labeled as "-1"). "+" shows that it is an incorrect prediction. "2:-1" represents non-catalytic residue ("-1") which is the second ("2:") class (This is a two-class classification). "1:1" represents catalytic residue (the latter "1") which is the first ("1:") class.
     
    "2 1:1 1:1 0.995", which means that the sample with serial number 2 is predicted as catalytic residue (labeled as "1") with a probability of 0.995, while the actual class stored in the input file is also catalytic residue.