This web page provides appendices associated
with
Kurgan
LA, Cios KJ, Chen K, 2008. SCPRED: Accurate Prediction of
Protein Structural Class for Sequences of Twilight-zone Similarity with
Predicting Sequences. BMC Bioinformatics, 9:226
- Datasets
The datasets are provides in two alternative formats: (1) the original
sequences; and (2) feature values computed for the sequences.
The datasets with the sequences are provided in comma separable format
(CSV)
and include the following information: (1) PDBid of the sequences
(including the location of a domain, if applicable), (2) sequence in
single-letter encoding, (3) secondary structure predicted with
PSI-PRED, and (4) the corresponding actual structural class (to
validate the predictions). The class is encoded as follows: a =
all-alpha, b = all-beta, c = alpha / beta, and d = alpha+beta.
The datasets with the features are provided as ARFF files,
which
is
the format required by WEKA platform that is used carry-out the
predictions. The files include the nine features computed from the
sequences and the predicted secondary structure, and the corresponding actual structural class (to
validate the predictions).
The 25PDB dataset can be
dowloaded from here: sequences, features
The FC699 dataset can be downloaded from here: sequences, features
Note that FC699 dataset with sequences also contains information about
the corresponding protein folds. This file contains all
domains (except the small proteins) that were used in the original
paper that introduced the PFRES fold classification system. A
list of protein chains that share sequences identity of above 35% with
the 25PDB file (and thus were removed from the original dataset for the
purpose of testing SCPRED) is given here: FC699_high_identity
- Prediction
model
The model is in WEKA's format, and implements the RBF-kernel based
Support Vector Machine classifier.
It can be dowloaded from here: SCPRED
model.
- Instructions
to
perform predictions with SCPRED
The
user should
follow the following procedure:
- Download and install
WEKA platform.
This free, open source platform can be dowloaded from here: http://www.cs.waikato.ac.nz/ml/weka/index_downloading.html
- Download and save the SCPRED
model in a root folder where the
WEKA was installed.
- In the same folder,
create a file that stores
the input for the prediction. Example file can be dowloaded from here: example
input.
Note that this file includes values of the nine features + a dummy
class label (prediction), which could be used to automate evaluation of
the prediction results. This file can include multiple lines with data,
which allows predicting multiple sequences in a
single run.
- Open command line window
and navigate to the
directory where the model and the input file are located.
- Execute the following
command
java -classpath "%CLASSPATH%;weka.jar" weka.classifiers.functions.SMO
-l SCPRED.model -T example.arff -p 0
where weka.classifiers.functions.SMO
specifies location of the engine that runs Support Vector Machine
classifier, -l specifies location of the file with the prediction
model, -T specifies
location of the file with data to predict, and -p specifies how the
results are displayed.
Additional help with respect to command line execution of models in
WEKA can be found here:
http://weka.sourceforge.net/wekadoc/index.php/en%3APrimer
- Read the prediction from
the screen.
The first column provides the input number, the second column provides
the predicted class, and the last column provides the class label
(dummy class label) provided in the file with input data.
The output for the provided example should read: "0 a 0.5 dummy_class",
which
means that for sample number 0 the predicted class is a, while the
actual
class stored in the input file was dummy_class.
|