Previous - Next - Bottom - PP home - PP help TOC

PP Help 09: Hints


  1. Automatic parsing of PredictProtein output
  2. Note
  3. What to expect from predictions?
  4. In a nutshell: how to avoid pitfalls?
  5. Nuts and bolts: what to keep in mind?
    1. Information content in sequence alignment
    2. Cut-off for including homologues in alignment
    3. Quality of multiple sequence alignment
    4. Minimal length of sequences
    5. Insertions in multiple sequence alignment ( avoid this! )
    6. Untypical' proteins
    7. Prediction of transmembrane helices and topology
    8. Reliability indices for PHD predictions
    9. Combining results with that of other methods
    10. Homologue of known structure
    11. Prediction-based threading
  6. Paper describing pitfalls in sequence analysis:



The following notes result from the experiences I have gathered by offering, and running the PredictProtein service and during various structure prediction workshops. The comments are tailored in particular to the PHD methods; however, most comments hold also for using other secondary structure prediction methods.

What can you expect from secondary structure prediction?

How accurate are the predictions ? The expected levels of accuracy (PHDsec = 72±11% (three state per-residue accuracy); PHDacc = 75±7% (two-state per-residue accuracy); PHDhtm = 94±6% (two-state per-residue accuracy)) are valid for typical globular, water-soluble (PHDsec, PHDacc), or helical transmembrane proteins (PHDhtm) when the multiple alignment contains many and diverse sequences. High values for the reliability indices indicate more accurate predictions. (Note: for alignments with little variation in the sequences, the reliability indices adopt misleadingly high values.) PHDsec predictions tend to be relatively accurate for porins; however, for helical membrane proteins other programs ought to be used.

Confusion between strand and helix? PHD (as well as other methods) focuses on predicting hydrogen bonds. Consequently, occasionally strongly predicted (high reliability index) helices are observed as strands and vice versa (expected accuracy of PHDsec).

Strong signal from secondary structure caps? The ends of helices and strands contain a strong signal. However, on average PHD predicts the core of helices and strands more accurately than the caps (B. Rost and C. Sander, 1D secondary structure prediction through evolutionary profiles, in: H. Bohr and S. Brunak (eds.), Protein Structure by Distance Analysis, Amsterdam: IOS Press, 257-276 (1994)). This seems to also hold for other methods (Garnier, priv. comm.).

Are internal helices predicted poorly? Steven Benner has indicated that internal buried helices are particularly difficult to predict. On average, this is not the case for PHD predictions (expected accuracy of PHDsec for buried helices).

Accessibility useful to provide upper limits for contacts? The predicted solvent accessibility (PHDacc) can be translated into a prediction of the number of water atoms around a given residue. Consequently, PHDacc can be used to derive upper and lower limits for the number of inter-residue contacts of a certain residue (such an estimate could improve predictions of inter-residue contacts).

How to predict porins? PHDhtm predicts only transmembrane helices, and PHDsec has been trained on globular, water-soluble proteins. How to predict 1D structure for porins then? As porins are partly accessible to solvent, prediction accuracy of PHDsec was relatively high (70%) for the known structures. Thus, PHDsec appears to be applicable.

How to use the prediction of transmembrane helices? One possible application of PHDhtm is to scan, e.g., entire chromosomes for possible transmembrane proteins. The classification as transmembrane protein is not sufficient to have knowledge about function, but may shed some light into the puzzle of genome analyses. When using PHDhtm for this purpose, the user should keep in mind that on average about 5% of the globular proteins are falsely predicted to have transmembrane helices.

What about protein design and synthesised peptides? The PHD networks are trained on naturally evolved proteins. However, the predictions have proven to be useful in some cases to investigate the influence of single mutations (e.g. for Chameleon ), or for Janus, Rost, unpublished). For short poly-peptides, the following should be taken into account: the network input consists of 17 adjacent residues, thus, shorter sequences may be dominated by the ends (which are treated as solvent).

In a nutshell: how to avoid pitfalls?

70% correct implies 30% incorrect. The most accurate methods for predicting secondary structure reach sustained levels of about 70% accuracy. When interpreting predictions for a particular protein it is often instructive to mark the 30% of the residues you suspect to be falsely predicted.

Spread of prediction accuracy. An expected accuracy of 70% does NOT imply that for your protein U 70% of all residues are correctly predicted. Instead, values published for prediction accuracy are averaged over hundreds of unique proteins. An expected accuracy of 70±10% (one standard deviation) implies that, on average, for two thirds of all proteins between 60 and 80% of the residues will be predicted correctly (expected accuracy of PHDsec). Thus, prediction accuracy can be higher than 80% or lower than 60% for your protein. Few methods supply well tested indices for the reliability of predictions. Such indices can help to reduce or increase your trust in a particular prediction.

Special classes of proteins. Prediction methods are usually derived from knowledge contained in subsets of proteins from databases. Consequently, they should not be applied to classes of proteins which have not been included in the subsets. For example, methods for predicting helices in globular proteins are likely to fail when applied to predict transmembrane helices. In general, results should be taken with caution for proteins with unusual features, such as proline-rich regions, unusually many cysteine bonds, or for domain interfaces.

Better alignments yield better predictions. Multiple alignment-based predictions are substantially more accurate than single sequence-based predictions. How many sequences do you need in your alignment to expect an improvement; and how sensitive are prediction methods with respect to errors in the alignment? The more divergent sequences contained in the alignment, the better (two distantly related sequences often improve secondary structure predictions by several percentage points). Regions with few aligned sequences yield less reliable predictions. The sensitivity to alignment errors depends on the methods, e.g., secondary structure prediction is less sensitive to alignment errors than accessibility prediction.

Better + worse = even better? Today, several automatic services accomplish secondary structure predictions. Some users fall into the what-is-common-is-correct trap, i.e., they average over all prediction methods and consider identical regions as more reliable. Exceptionally, such a majority vote may be beneficial. However frequently, the result will be the worst-of-all prediction. Often, it is preferable to use reliability indices provided by some methods. Such indices answer the question: how reliably is the tryptophan at position 307 predicted in a surface loop? (Note: the correlation between such indices and prediction accuracy is sufficiently tested for a few methods, only.)

1D structure may or may not be sufficient to infer 3D structure. Say you obtain as prediction for regular secondary structure: helix-strand-strand-helix-strand-strand (H-E-E-H-E-E). Assume, you find a protein of known structure with the same motif (H-E-E-H-E-E). Can you conclude that the two proteins have the same fold? Yes and no, your guess may be correct, but there are various ways to realise the given motif by completely different structures. For example, the secondary structure motif 'H-E-E-H-E-E' is contained in, at least, 16 structurally unrelated proteins.

Nuts and bolts: what to keep in mind?

Information content in multiple sequence alignment

If the multiple sequence alignment contains only a few proteins very similar to the one you sent (pairwise sequence identity > 90%), the expected accuracy for 1D structure predictions (secondary structure, accessibility, transmembrane helices) drops significantly. Note: this implies a reduction of the expected accuracy for threading. The scores for expected accuracy (PHDsec, PHDacc., PHDhtm) are valid for typical alignments as to be found in the HSSP database. The information content of the alignment is difficult to measure. Two important parameters are:


Cut-off for including homologues in alignment

In the multiple sequence alignment returned to you, only homologues down to levels of 30% pairwise sequence identity over 80 or more residues are included. This cut-off is five percentage points above the threshold for structural homology (Sander & Schneider, 1990), in an attempt to stay clearly off the twilight zone of sequence similarity, and provide high-quality multiple alignments in an automated fashion.

Quality of multiple sequence alignment

On average, more residues are falsely aligned for lower levels of pairwise sequence identity. Down to levels of about 30%, the automatic MaxHom alignments are usually quite accurate. However, for many families there are regions for which the 'correct' alignment is, in principle, not well defined. One way to spot such regions is the stability of the alignment with respect to including or excluding some of the aligned sequences. By providing different lists of sequences ("input option 'PIR list'") you can monitor the stability of the alignment. Often such regions may form surface loops. Predictions may be less accurate in such regions.

Minimal length of sequences

The PHD programs treat N- and C-terminal ends of proteins as solvent molecules. The size of the input window for predicting 1D structure is up to 17 residues. Thus, the first and the last 17 residues of your sequence will 'see solvent'. Especially for short fragments you did cut out from large proteins, this may result in false predictions.

Insertions in multiple sequence alignment

'Untypical' proteins

Prediction of transmembrane helices (HTM's) and topology

Reliability indices for PHD predictions

The reliability indices of the PHD methods correlate well with prediction accuracy. In other words, residues predicted with high reliability (0 = low, 9 = high) are more likely to be predicted correctly. However, when basing the prediction on single sequences (rather than multiple alignments) the scale has to be shifted. For instance, values of RI > 4 usually imply an expected accuracy of > 80% for PHDsec. When using a single sequence as input the same level of accuracy is reached only for residues predicted at RI > 7.

Combination of results with that of other methods

A combination of two prediction methods is likely to improve the accuracy only if the following points are met: Say you want to focus on the most likely secondary structure segments. You may hope that the best predicted segments are those for which methods X and PHDsec agree. This may or may not be correct. However, it may be more reasonable to identify such regions based on the reliability index provided by PHD. The PHD methods have been tailored to provide a reasonable estimate for the reliability of the prediction, whereas a combination of two arbitrary prediction methods, at best, yields improvements, at random.

Homologue of known structure

Ab-initio prediction (by e.g. PHD) is, in general, less accurate than is homology modelling. Thus, if we find a protein of known structure that has > 25% pairwise sequence identity to your sequence, you ought to make use of the known structure by homology modelling.

Prediction-based threading

Previous - Next - Top - PP home - PP help TOC