SCPRED method for in-silico prediction of protein structural classes

 
This web page provides appendices associated with 


Kurgan LA, Cios KJ, Chen K, 2008. SCPRED: Accurate Prediction of Protein Structural Class for Sequences of Twilight-zone Similarity with Predicting Sequences. BMC Bioinformatics, 9:226

  1. Datasets
    The datasets are provides in two alternative formats: (1) the original sequences; and (2) feature values computed for the sequences.

    The datasets with the sequences are provided in comma separable format (CSV) and include the following information: (1) PDBid of the sequences (including the location of a domain, if applicable), (2) sequence in single-letter encoding, (3) secondary structure predicted with PSI-PRED, and (4) the corresponding actual structural class (to validate the predictions). The class is encoded as follows: a = all-alpha, b = all-beta, c = alpha / beta, and d = alpha+beta.

    The datasets with the features are provided as ARFF files, which is the format required by WEKA platform that is used carry-out the predictions. The files include the nine features computed from the sequences and the predicted secondary structure, and the
    corresponding actual structural class (to validate the predictions).

    The 25PDB dataset can be dowloaded from here: sequences, features
    The FC699 dataset can be downloaded from here:
    sequences, features

    Note that FC699 dataset with sequences also contains information about the corresponding protein folds. This file contains all domains (except the small proteins) that were used in the original paper that introduced the PFRES fold classification system. A list of protein chains that share sequences identity of above 35% with the 25PDB file (and thus were removed from the original dataset for the purpose of testing SCPRED) is given here:
    FC699_high_identity

  2. Prediction model
    The model is in WEKA's format, and implements the RBF-kernel based Support Vector Machine classifier.
    It can be dowloaded from here: 
    SCPRED model.

  3. Instructions to perform predictions with SCPRED
    The user should follow the following procedure:
  1. Download and install WEKA platform. This free, open source platform can be dowloaded from here: http://www.cs.waikato.ac.nz/ml/weka/index_downloading.html
  2. Download and save the SCPRED model in a root folder where the WEKA was installed.
  3. In the same folder, create a file that stores the input for the prediction. Example file can be dowloaded from here: example input.
    Note that this file includes values of the nine features + a dummy class label (prediction), which could be used to automate evaluation of the prediction results. This file can include multiple lines with data, which allows predicting multiple sequences in a single run.
  4. Open command line window and navigate to the directory where the model and the input file are located.
  5. Execute the following command
    java -classpath "%CLASSPATH%;weka.jar" weka.classifiers.functions.SMO -l SCPRED.model -T example.arff -p 0
    where 
    weka.classifiers.functions.SMO specifies location of the engine that runs Support Vector Machine classifier, -l specifies location of the file with the prediction model, -T specifies location of the file with data to predict, and -p specifies how the results are displayed.
    Additional help with respect to command line execution of models in WEKA can be found here:
    http://weka.sourceforge.net/wekadoc/index.php/en%3APrimer
  6. Read the prediction from the screen. 
    The first column provides the input number, the second column provides the predicted class, and the last column provides the class label (dummy class label) provided in the file with input data.
    The output for the provided example should read: "0 a 0.5 dummy_class", which means that for sample number 0 the predicted class is a, while the actual class stored in the input file was dummy_class.