CRPred method for in-silico prediction of catalytic residues

CRpred method for in-silico prediction of catalytic residues

This web page provides datasets and prediction model associated with

Zhang T, Zhang H, Chen K, Shen S, Ruan J, Kurgan LA, 2008. Accurate sequence-based prediction of catalytic residues, Bioinformatics, 24 (20):2329-233 8

Datasets
Each dataset is packaged into one zip file containing all the sequences. Each sequence is stored in one file, in which each line represents one residue and includes the following information separated by colon:
1) residue type in single letter encoding (Column 1)
2) catalytic annotation (Column 2)
3) feature values computed for that residue
The EF fold dataset can be downloaded from here: EF_fold
The EF superfamily dataset can be downloaded from here: EF_superfamily
The EF family dataset can be downloaded from here: EF_family
The HA superfamily dataset can be downloaded from here: HA_superfamily
The NN dataset can be downloaded from here: NN
The PC dataset can be downloaded from here: PC
The T-124 dataset can be downloaded from here: T-124
The T-37 dataset can be downloaded from here: T-37

The ST-1109 dataset is used for statistical analysis. The list of 1109 protein chains it contains is given here: ST-1109
Prediction model
The model is in WEKA's format, and implements the RBF-kernel based Support Vector Machine classifier.
It can be downloaded from here: CRPred model .
Instructions to perform predictions with CRpred
The user should use the following procedure:

Download and install WEKA platform. This free, open source platform can be dowloaded from here: http://www.cs.waikato.ac.nz/ml/weka/index_downloading.html
Download the wrapper class for the libsvm tools and add it into WEKA classpath. Detailed information can be found here: http://www.cs.iastate.edu/~yasser/wlsvm/
Download and save the CRPred model in a root folder where the WEKA was installed.
In the same folder, create a file that stores the input for the prediction. Example file can be dowloaded from here: example input .
Note that this file includes values of the five types of features + the class label (classification target, 1 for catalytic residue, -1 for non-catalytic residue), which could be used to automate evaluation of the prediction results (user can use dummy values if the true outcomes are unknown). This file can include multiple lines with data, which allows predicting multiple sequences in a single run.
Open command line window and navigate to the directory where the model and the input file are located.
Execute the following command
java -classpath weka.jar;libsvm.jar;wlsvm.jar weka.classifiers.functions.LibSVM -l CRPred.model -T example.arff -p 0
where weka.classifiers.functions.LibSVM specifies location of the engine that runs Support Vector Machine classifier, -l specifies location of the file with the prediction model, -T specifies location of the file with data to predict, and -p specifies how the results are displayed.
Additional help with respect to command line execution of models in WEKA can be found here:
http://weka.sourceforge.net/wekadoc/index.php/en%3APrimer
Read the prediction(s) from the screen.
The first column provides the serial number, the second column provides the actual class label (taken from the input file), the third column provides the predicted class label, and the last column provides the probability estimate associated with the prediction. Incorrect predictions are marked with "+". The output for the provided example has two predictions:

"1 2:-1 1:1 + 0.587", which means that the sample with serial number 1 is predicted as catalytic residue (labeled as "1") with a probability of 0.587, while the actual class stored in the input file is non-catalytic residue (labeled as "-1"). "+" shows that it is an incorrect prediction. "2:-1" represents non-catalytic residue ("-1") which is the second ("2:") class (This is a two-class classification). "1:1" represents catalytic residue (the latter "1") which is the first ("1:") class.

"2 1:1 1:1 0.995", which means that the sample with serial number 2 is predicted as catalytic residue (labeled as "1") with a probability of 0.995, while the actual class stored in the input file is also catalytic residue.