fDETECT - Tutorial and Help Page

This page describes the fDETECT method and explains how to make predictions and interpret the results.

Description of the server

The fDETECT webserver predicts propensity for completion of material production, purification, crystallization and diffraction-quality crystallization for input protein sequence(s). The server accepts FASTA formatted protein sequence(s). For a given input sequence it produces numeric propensities for each of the four possible outcomes: failure of material production (MF), failure to purify (PF), failure to crystallize (CF) and success to yield diffraction-quality crystals (CR). fDETECT also provides the overall prediction which is based on the highest propensity.

The original version of fDETECT that predicts propensity for success to yield diffraction-quality crystals was published in [Acta Crystallographica D 2014, 70(11):2781-93]. This webserver includes predictions of all four outcomes which are based on predictive models designed using the same dataset and approach as in [Acta Crystallographica D 2014, 70(11):2781-93].

The fDETECT method is fast (less than 1 second per protein) and relies on four logistic regression-based predictive models, one for each of the four outcomes. The inputs to these models are computed based on instability, isoelectric point, polarity, estimated propensity to form secondary structures, characteristics of estimated solvent accessibility, hydrophobicity and energy estimated with amino acid indices collected from the AAIndex database, composition of certain amino acids, and complexity of the input sequence generated with the SEG algorithm.

The fDETECT method is recommended for the users who require fast and accurate predictions for a large dataset of proteins. Based on empirical tests this method offers the most accurate predictions for the failure of material production (MF). The PPCpred method, which can be also run using this website, offers slightly better prediction for the failure to purify (PF), failure to crystallize (CF), and success to yield diffraction-quality crystals (CR) steps but at a much higher computational cost. The runtime of PPCpred is about three orders of magnitude higher (~1000 times) than the runtime of fDETECT.

How to run predictions

The following four easy steps should be followed to run predictions:

  1. Choose whether to use fDETECT, PPCpred or both. fDETECT is fast (prediction of one protein sequence takes less than one second) and you are allowed to submit up to 1000 proteins as the input. PPCpred is slower (prediction of one protein sequence takes up to several minutes) and you can enter up to 5 sequences if this method is selected.

  2. Copy and paste protein sequence(s) in the FASTA format into the text field (an "Example" button may be used to see an example input in the FASTA format).

  3. Provide email address (optional). If email is provided, a notification email that provides link to the results will be sent after the results are ready. Whether or not the email address is provided, the results are delivered directly via the browser. Please ensure that you do not close or refresh the pages that appear after you click "Run fDETECT" to ensure that the results are delivered in the browser window.

  4. Click "Run fDETECT" button to start the predictions.

After clicking "Run fDETECT", the request will be added to a queue of currently processed jobs on the biomine server. Your request will be executed in the order in which it was received. You will be notified about your position in the queue and a notification will be displayed when the request is being processed. Once the processing is finished, you will be re-directed to a page with the link to the results (the same link will be sent by email if it was provided). The page with the results displays the complete results. Interpretation of these results is explained in the Interpretation of the results section.

Interpretation of the results

The results are provided in the CSV and HTML formats. The CSV files are downloadable and provide the results of predictions for each input sequence. Each line in this file represents a protein sequence in the same order in which it was entered for prediction. Each line includes multiple columns which show the protein ID, the sequence, the method used (fDETECT or PPCpred), the predicted outcome and the predicted propensities for each of four outcomes. If both fDETECT and PPCpred are selected, the csv file will show predictions in two lines for each protein, one for fDETECT and the other for PPCpred.

The predicted outcome is the outcome that has the highest propensity among the four outcomes. To ease interpretation of these propensities they are categorized into three levels: low, medium and high. These levels were determined by analyzing the propensities generated for each native outcome (MF, PF, CF and CR) on the benchmark dataset. The "low"/"high" category corresponds to the lowest/highest 20% of the propensities. The "medium" is for the remaining propensities between 20 and 80 percentiles. For example, protein "CESG_GO_23885_CESG" (first line in the screenshot below) has the high propensity score for the material production failure (0.599). This means the propensity generated for this protein is as high as the propensities generated for the 20% of native MF proteins that have the highest propensities. Thus, the user should have high confidence in this prediction. On the other hand, this protein has low score for the success of the diffraction-quality crystallization step (0.173). It means that 0.173 is as low as the propensities generated for the 20% of native CR proteins that have the lowest propensities. Thus, the confidence in the prediction that this protein can be solved structurally by X-ray crystallography is low.

    Definition of the propensity levels for fDETECT:
  • MF outcome: low (0, 0.271]; medium (0.271,0.497]; high >0.497
  • PF outcome: low (0, 0.237]; medium (0.237,0.543]; high >0.543
  • CF outcome: low (0, 0.152]; medium (0.152,0.553]; high >0.553
  • CR outcome: low (0, 0.255]; medium (0.255,0.489]; high >0.489
    Definition of the propensity levels for PPCpred:
  • MF outcome: low (0, 0.302]; medium (0.302,0.737]; high >0.737
  • PF outcome: low (0, 0.160]; medium (0.160,0.569]; high >0.569
  • CF outcome: low (0, 0.048]; medium (0.048,0.311]; high >0.311
  • CR outcome: low (0, 0.322]; medium (0.322,0.688]; high >0.688

The csv format can be easily parsed. We provide a simple parser of the csv file written in Python here: parse_csv.zip. A sample of the CSV file is shown below. This example shows the prediction results of fDETECT for the three sample proteins that are provided on the main webserver page.

Protein ID SEQ Method predicted class material failed propensity (score level) purification failed propensity (score level) crystalization failed propensity (score level) diffraction-quality crystallization success propensity (score level)
CESG_GO_23886_CESG MHV... fDETECT Material Failed 0.599 (high score) 0.478 (medium score) 0.323 (medium score) 0.173 (low score)
NYSGXRC_10360i_NYS VEW... fDETECT Crystallization Failed 0.305 (medium score) 0.456 (medium score) 0.608 (high score) 0.212 (low score)
CSGID_IDP01182_CSG MIV... fDETECT Diffraction-quality crystallization success 0.238 (low score) 0.362 (medium score) 0.104 (low score) 0.598 (high score)

The HTML page shows the predicted outcome for each protein, together with the four putative propensities for the four outcomes. The bar on the left side of each protein is color coded. The colors represent the predicted outcomes: red corresponds to failure of material production, yellow corresponds to failure to purify, purple corresponds to failure to crystallize and green corresponds to the successful diffraction-quality crystallization. The first line for each protein is the protein ID, followed by the name of the method used for the prediction and the overall color-coded prediction for this protein. The bar on the left side indicates the predicted outcome. The results also include the four predicted propensities for failure of material production, failure to purify, failure to crystallize and successful diffraction-quality crystallization. Each score is assigned with a label that indicates likelihood that this label is correctly predicted: low, medium and high. The screenshot below shows the HTML page for results from fDETECT for the first two sample proteins that are provided on the main webserver page.

CESG_GO_23886_CESG

fDETECT

Target CESG_GO_23886_CESG is predicted to fail to produce protein material.

The propensities for the four outcomes of crystallization are:
  • Propensity that production of protein material fails is 0.599 (high score).
  • Propensity that purification fails is 0.478 (medium score).
  • Propensity that crystallization fails is 0.323 (medium score).
  • Propensity that target yields diffraction-quality crystals is 0.173 (low score).

NYSGXRC_10360i_NYS

fDETECT

Target NYSGXRC_10360i_NYS is predicted to fail to crystallize.

The propensities for the four outcomes of crystallization are:
  • Propensity that production of protein material fails is 0.305 (medium score).
  • Propensity that purification fails is 0.456 (medium score).
  • Propensity that crystallization fails is 0.608 (high score).
  • Propensity that target yields diffraction-quality crystals is 0.212 (low score).

In the above example, the first protein is predicted to fail at the material production step. The propensity to fail to produce the protein material is 0.599, and this score has label "high" which indicates a high level of confidence in this prediction. In other words, it is likely that this protein will fail at the material production step. The four propensities for this protein show that the prediction of the failure of protein material production has high confidence, prediction of the failure to purify and crystallize have medium confidence and prediction of the ability to produce diffraction-quality crystals has low confidence. Altogether, since the propensity for the failure at the material production has the highest score and high confidence, which are higher than the scores and confidence levels for the other three steps, the user should consider this prediction as a strong indication that the material production will fail.

The second protein is predicted to fail at the crystallization step. The propensity to fail at the crystallization step is 0.608, and this score has label "high" which indicates a high level of confidence in this prediction. In other words, it is likely that this protein will fail at the crystallization step. The four propensities for this protein show that the predictions of the failure of protein material production and the failure of purification have medium confidence, prediction of the failure of crystallization has high confidence, and prediction to produce diffraction-quality crystals has low confidence. Altogether, since the propensity for the failure at the crystallization step has the highest score and high confidence, the user should consider this prediction as a strong indication that the crystallization step will fail.

The screenshot below shows the HTML page for results when a user selects to include predictions from both fDETECT and PPCpred for the third sample protein provided on the main webserver page.

CSGID_IDP01182_CSG

fDETECT

Target CSGID_IDP01182_CSG is predicted to yield diffraction-quality crystals.

The propensities for the four outcomes of crystallization are:
  • Propensity that production of protein material fails is 0.238 (low score).
  • Propensity that purification fails is 0.362 (medium score).
  • Propensity that crystallization fails is 0.104 (low score).
  • Propensity that target yields diffraction-quality crystals is 0.598 (high score).

PPCpred

Target CSGID_IDP01182_CSG is predicted to yield diffraction-quality crystals.

The propensities for the four outcomes of crystallization are:
  • Propensity that production of protein material fails is 0.116 (low score).
  • Propensity that purification fails is 0.187 (medium score).
  • Propensity that crystallization fails is 0.035 (low score).
  • Propensity that target yields diffraction-quality crystals is 0.394 (medium score).

The third protein is predicted to yield diffraction-quality crystals by both fDETECT and PPCpred. The propensities to yield diffraction-quality crystals are 0.598 for fDETECT and 0.394 for PPCpred. These propensities are labeled as "high" for fDETECT and "medium" for PPCpred, which suggest high and medium levels of confidence in these predictions, respectively. The other three propensities for this protein show that the predictions of the failure of protein material production and the failure of crystallization have low confidence values for both fDETECT and PPCpred, while the prediction of the failure of purification have medium confidence values again for both fDETECT and PPCpred. Given that the confidence level in the final prediction from fDETECT, which is success to yield diffraction-quality crystals, is high and the other three confidence scores are ranked medium or low, the user should have high degree of confidence in this prediction. On the other hand, given that the confidence level in the final prediction from PPCpred, which is success to yield diffraction-quality crystals, is medium and there is another outcome with similar medium level of confidence (purification fails) the confidence in this prediction should be modest. Altogether, given the agreement between both methods the user should be confident that this protein will likely yield diffraction-quality crystals.