This web page provides datasets and prediction model associated
with
Zhang
T, Zhang H, Chen K, Shen S, Ruan J, Kurgan LA, 2008. Accurate
sequence-based prediction of catalytic residues, Bioinformatics, 24(20):2329-2338
- Datasets
Each dataset is packaged into one zip file containing all the
sequences. Each sequence is stored in one file, in which each line
represents one residue and includes the following information separated
by colon:
1) residue type in single letter encoding (Column 1)
2) catalytic annotation (Column 2)
1 represents catalytic residue; -1 represents non-catalytic residue.
3) feature values computed for that residue
13 selected ResType features (Column 3-15). Click for
details.
166 selected PSSM features (Column 16-181). Click for
details.
9 selected EntWOP features (Column 182-190). Click for
details.
15 selected CRPair features (Column 191-205). Click for
details.
7 selected AveCH features (Column 206-212). Click for
details.
The EF fold dataset can be downloaded from here: EF_fold
The EF superfamily dataset can
be downloaded from here: EF_superfamily
The EF family dataset can be
downloaded from here: EF_family
The HA superfamily dataset can
be downloaded from here: HA_superfamily
The NN dataset can be downloaded
from here: NN
The PC dataset can be downloaded
from here: PC
The T-124 dataset can be
downloaded from here: T-124
The T-37 dataset can be
downloaded from here: T-37
The ST-1109 dataset is used for statistical analysis. The list of 1109
protein chains it contains is given here: ST-1109
- Prediction
model
The model is in WEKA's format, and implements the RBF-kernel based
Support Vector Machine classifier.
It can be downloaded from here: CRPred
model.
- Instructions
to
perform predictions with CRpred
The
user should use the following procedure:
- Download and install
WEKA platform.
This free, open source platform can be dowloaded from here: http://www.cs.waikato.ac.nz/ml/weka/index_downloading.html
- Download the wrapper
class for the libsvm tools and add it into WEKA classpath. Detailed
information can be found here: http://www.cs.iastate.edu/~yasser/wlsvm/
- Download and save the CRPred
model in a root folder where the
WEKA was installed.
- In the same folder,
create a file that stores
the input for the prediction. Example file can be dowloaded from here: example
input.
Note that this file includes values of the five types of features + the
class label (classification target, 1 for catalytic residue, -1 for
non-catalytic residue), which could be used to automate evaluation of
the prediction results (user can use dummy values if the true outcomes
are unknown). This file can include multiple lines with data,
which allows predicting multiple sequences in a
single run.
- Open command line window
and navigate to the
directory where the model and the input file are located.
- Execute the following
command
java -classpath weka.jar;libsvm.jar;wlsvm.jar
weka.classifiers.functions.LibSVM -l CRPred.model -T example.arff -p 0
where weka.classifiers.functions.LibSVM
specifies location of the engine that runs Support Vector Machine
classifier, -l specifies location of the file with the prediction
model, -T specifies
location of the file with data to predict, and -p specifies how the
results are displayed.
Additional help with respect to command line execution of models in
WEKA can be found here:
http://weka.sourceforge.net/wekadoc/index.php/en%3APrimer
- Read the prediction(s)
from
the screen.
The first column provides the serial number, the second column provides
the actual class label (taken from the input file), the third column
provides the predicted class label, and the last column provides the
probability estimate associated with the
prediction. Incorrect predictions are marked with "+".
The output for the provided example has two predictions:
"1 2:-1 1:1 + 0.587", which means that the sample with serial number 1
is predicted as catalytic residue (labeled as "1") with a probability
of 0.587, while the actual class stored in the input file is
non-catalytic residue (labeled as "-1"). "+" shows that it is an
incorrect prediction. "2:-1" represents non-catalytic residue ("-1")
which is the second ("2:") class (This is a two-class classification).
"1:1" represents catalytic residue (the latter "1") which is the first
("1:") class.
"2 1:1 1:1 0.995", which means that the sample with serial number 2 is
predicted as catalytic residue (labeled as "1") with a probability of
0.995, while the actual class stored in the input file is also
catalytic residue.
|