FlexPred: a web-server for predicting residue positions
involved in conformational switches in proteins

Main page: http://flexpred.rit.albany.edu

IMPORTANT: Please submit jobs with at least 3 second interval and do not submit more than 50 jobs at a time. Otherwise, the jobs will be deleted and your IP address blocked.

Summary

Conformational flexibility is an inherent property of protein structure. Large scale changes in the conformation of the protein backbone play a key role in a variety of fundamental biological activities and have been implicated in a number of diseases. Protein backbone flexibility can be classified into two broad categories: (1) disordered flexibility observed in intrinsically-unstructured segments that do not have a well-defined folded structure and (2) ordered flexibility observed in segments that switch from one specific folded backbone conformation to another. FlexPred is a web-server for predicting conformationally variable residue positions that may be involved in ordered conformational switches (if you are interested in protein disorder prediction, please refer to the DisProt web-site). FlexPred is based on a supervised classification algorithm called Support Vector Machine (SVM). SVM was trained to predict conformationally variable positions using experimental examples observed in the Database of Macromolecular Movements (MolMovDB) that contains examples of ordered conformational transitions obtained by comparing experimental atomic-level structures of the same protein solved under different conditions. All MolMovDB proteins were clustered at 20% pair-wise sequence identity using the PISCES web-server. The final non-redundant training dataset, called FlexProt (available at http://lcg.rit.albany.edu/flexprot), contains 137 proteins. A conformationally variable residue position here is defined as a position in which a significant change in at least one backbone dihedral angle, phi or psi, is observed when two alternative conformations of the same protein are compared. A ‘significant change’ means that a change in dihedral angle, dphi or dpsi, is above two standard deviations (Sdphi=31° and Sdpsi=34°). For more details please refer to the original publication (Kuznetsov, 2008).

FlexPred can predict conformationally variable positions in a given protein using either its amino acid sequence or a corresponding entry in the Protein Data Bank (PDB) file format. If a FASTA sequence is submitted, then the prediction is performed using sequence-derived information only. If a PDB file is submitted, then the prediction is performed using sequence-derived information and normalized solvent accessibility of residue positions. Solvent accessibility is calculated using the DSSP program (Kabsch and Sander, 1983).

Performance evaluation

The SVM classifiers implemented in FlexPred were tested using a 10-fold cross-validation. In this approach, the FlexProt dataset is randomly partitioned into 10 groups of proteins, each containing roughly 10% of the dataset. At each cross-validation run, one group is removed and a classifier is trained on the remaining proteins and then tested on the removed proteins. The process is repeated 10 times, so that each group is used for testing once. The following performance measures were used: true positive rate (TPr) and false positive rate (FPr)  defined below. The relationship between FPr and TPr is reflected by the Receiver Operating Characteristic (ROC) curve (Figure 1).

TPr=TP/(TP+FN)

FPr=FP/(FP+TN)

1. Input options

For sequence-based prediction the user can either paste or upload an amino acid sequence in FASTA file format. For prediction that uses both protein sequence and structural information the user can either upload a PDB file or just supply a four-letter PDB id and let the web-server retrieve a corresponding PDB file automatically. A detailed description of the PDB file format is available from http://www.wwpdb.org/docs.html.

The following input options are available:

For options involving a PDB file the user will be asked to specify a subunit to be analyzed and whether or not the input PDB files contains multiple models (files with NMR structures usually contain multiple models). If a subunit is not specified, all subunits will be analyzed. If a file contains multiple models, all models will be analyzed. Note that in the case of PDB-based submissions protein sequences are read from the ATOM record. Thus, residues missing from the ATOM record but present in the SEQRES record are ignored.

NOTE: Because of a significantly better performance of structure-based prediction, we strongly recommend submitting PDB structures whenever possible!

FASTA sequence file format:

A FASTA file consists of a header line that begins with ">" character, followed by an optional sequence name and the sequence itself:

>Sequence name goes here
MALTNAQILAVIDSWEETVGQFPVITHHVPLGGGLQGTLHCYEIPLAAPYGVGFAKNGPT
RWQYKRTINQVVHRWGSHTVPFLLEPDNINGKTCTASHLCHNTRCHNPLHLCWESLDDNK
GRNWCPGPNGGCVHAVVCLRQGPLYGPGATVAGPQQRGSHFVV

Sequence string:

Sequence string should be represented using the standard IUB/IUPAC one-letter amino acid codes, which includes:
twenty characters for the twenty amino acid types:

  A  Alanine          M  Methionine
  C  Cysteine         N  Asparagine
  D  Aspartate        P  Proline
  E  Glutamate        Q  Glutamine
  F  Phenylalanine    R  Arginine
  G  Glycine          S  Serine
  H  Histidine        T  Threonine
  I  Isoleucine       V  Valine
  K  Lysine           W  Tryptophan
  L  Leucine          Y  Tyrosine 

and the following three characters:

  B  Aspartate or Asparagine 
  Z  Glutamate or Glutamine
  X  Unknown

These one-letter codes can be in either upper-case or lower-case. Note that unknown residues (X) and their sequence neighbors are excluded from the prediction. The maximum allowed sequence length is 1,000 residues.

2. Encoding of amino acid sequence

In order to use the SVM algorithm to predict conformationally variable positions, each residue position in the input protein sequence must be converted (encoded) into a suitable numeric representation. Two types of encoding of the input sequence are used in FlexPred: PSSM-based encoding and binary encoding.

PSSM-based encoding

PSSM-based encoding uses the profile of evolutionary conservation of residue positions generated by PSI-BLAST program (Altschul et al, 1997). For a sequence of length N residues, PSSM is represented by an Nx20 matrix. Each PSSM element m[i,j] is scaled  between 0 and 1 using the logistic function to obtain a normalized value. A sequence position i in the protein sequence is encoded using row i of this normalized matrix. For sequence-based submissions the PSSM encoding usually gives significantly more accurate prediction than the binary encoding (see Figure 1 and Table 1). However, the PSSM encoding is significantly slower because PSI-BLAST may take several minutes to finish.

Binary encoding

In this method, the 20 amino acids are represented as the following 20 mutually orthogonal binary vectors:

   Ala=(1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0)
   Cys=(0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0) 
   Asp=(0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0) 
   Glu=(0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0) 
   Phe=(0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0) 
   Gly=(0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0) 
   His=(0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0)
   Lys=(0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0) 
   Leu=(0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0) 
   Met=(0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0) 
   Asn=(0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0)
   Pro=(0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0) 
   Arg=(0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0) 
   Ser=(0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0) 
   Thr=(0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0) 
   Tyr=(0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0) 
   Trp=(0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0) 
   Ile=(0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0) 
   Gln=(0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0) 
   Val=(0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1)

3. False Positive Rate (FPr)

The false positive rate (FPr) gives the fraction of rigid positions incorrectly predicted as flexible, whereas the true positive rate (TPr) gives the fraction of flexible positions correctly identified as flexible. The relationship between FPr and TPr is reflected by the Receiver Operating Characteristic (ROC) curve (Figure 1). The user can choose the FPr of 5%, 10%, 15%, or 20% Since most statistical tests consider the 5% chance of false positive prediction to be an acceptable level, the FPr of 5% is selected by default.

IMPORTANT NOTE: Please note that for any prediction method, when FPr is decreased, TPr is also decreased, and vice versa.
This means that at low FPr thresholds many true flexible positions will be missed and predicted as rigid. Refer to Table 1 for details on the average percentage of true positives (TPr) recovered at each FPr threshold. We recommend the following strategy:

1) Run predictions for the same protein using different levels of FPr (from 5% to 20%).
2) Compare these predictions, checking if flexible positions predicted using high FPr levels tend to appear near flexible positions predicted using low FPr levels.

If flexible positions predicted for different levels of FPr do tend to cluster, this indicates possible conformationally flexible segments.

 

Figure 1. The Receiever Operating Characteristic (ROC) curves of the predictors.

 

Table 1. The true  positive rate (TPr) of FlexPred at 5%, 10%, 15%, and 20% false positive rate (FPr).


Encoding of sequence positions

TPr at 5% FPr

TPr at 10% FPr

TPr at 15% FPr

TPr at 20% FPr

Binary (sequence only)

11.9%

22%

29.4%

37.4%

PSSM (sequence only)

15.2%

26.1%

34.5%

42.3%

Binary + solvent accessibility

20.5%

32%

40.8%

48.5%

PSSM + solvent accessibility

18.6%

29.9%

40.2%

48.2%

4. Retrieval of the results

We recommend the user to enter his/her e-mail addresses so that the web-server can automatically e-mail the result when the prediction finishes. Alternatively, the user may select the second option in order to get a temporary URL link to his/her submission, bookmark this link, and manually check the prediction result later. The results of prediction will be kept on the web-server for 7 days from the moment of submission, and deleted afterwards.

Expected wait time

Since the web-server executes only one process at a given moment, all submissions are placed in a job queue. Therefore, the wait time for a given submission will be affected by the number of jobs submitted earlier and waiting in the job queue. Large number of jobs in the queue will result in a long wait time. If there are no prior submissions in the job queue, the estimated wait time is as follows:

If you do not receive results within 24 hours, please contact Igor Kuznetsov.

Output format

Figure 2 shows a sample prediction result that consists of a header that describes the output format itself, the selected encoding type, the selected false positive rate (FPr), the submitted sequence in FASTA format, and the predicted labels for each sequence position. Labels are 'R' (rigid) and 'F' (conformationally flexible). Column S_PRB shows the probability of each position being conformationally flexible (label 'F'). These probabilities are in range [0.0 to 1.0]. The higher the probability, the greater the confidence of the predicted label. The probability of label 'R' can be obtained by subtracting the corresponding probability of label 'F' from 1.

Figure 2. A sample output page.

References

Web-server citation:

I.B.Kuznetsov and M.McDuffie, 2008, FlexPred: a web-server for predicting residue positions involved in conformational switches in proteins. Bioinformation, 3(3):134-136

Methodology citation:

I.B.Kuznetsov, 2008, Ordered conformational change in the protein backbone: prediction of conformationally variable positions from sequence and low-resolution structural data.Proteins: Structure, Function and Bioinformatics, 72(1):74-87

Acknowledgement

This work was supported by grant number R03LM009034 from the National Library of Medicine of the National Institutes of Health.

Please address your questions and comments to Igor Kuznetsov