PP home -
PP help TOC
The following notes result from the experiences I have gathered by offering, and running the PredictProtein service and during various structure prediction workshops. The comments are tailored in particular to the PHD methods; however, most comments hold also for using other secondary structure prediction methods.
How accurate are the predictions ?
The expected levels of accuracy (PHDsec = 72±11% (three state per-residue accuracy); PHDacc = 75±7% (two-state per-residue accuracy); PHDhtm = 94±6% (two-state per-residue accuracy)) are valid for typical globular, water-soluble (PHDsec, PHDacc), or helical transmembrane proteins (PHDhtm) when the multiple alignment contains many and diverse sequences. High values for the reliability indices indicate more accurate predictions. (Note: for alignments with little variation in the sequences, the reliability indices adopt misleadingly high values.) PHDsec predictions tend to be relatively accurate for porins; however, for helical membrane proteins other programs ought to be used.
Confusion between strand and helix?
PHD (as well as other methods) focuses on predicting hydrogen bonds. Consequently, occasionally strongly predicted (high reliability index) helices are observed as strands and vice versa (expected accuracy of PHDsec).
Strong signal from secondary structure caps?
The ends of helices and strands contain a strong signal. However, on average PHD predicts the core of helices and strands more accurately than the caps (B. Rost and C. Sander, 1D secondary structure prediction through evolutionary profiles, in: H. Bohr and S. Brunak (eds.), Protein Structure by Distance Analysis, Amsterdam: IOS Press, 257-276 (1994)). This seems to also hold for other methods (Garnier, priv. comm.).
Are internal helices predicted poorly?
Steven Benner has indicated that internal buried helices are particularly difficult to predict. On average, this is not the case for PHD predictions (expected accuracy of PHDsec for buried helices).
Accessibility useful to provide upper limits for contacts?
The predicted solvent accessibility (PHDacc) can be translated into a prediction of the number of water atoms around a given residue. Consequently, PHDacc can be used to derive upper and lower limits for the number of inter-residue contacts of a certain residue (such an estimate could improve predictions of inter-residue contacts).
How to predict porins?
PHDhtm predicts only transmembrane helices, and PHDsec has been trained on globular, water-soluble proteins. How to predict 1D structure for porins then? As porins are partly accessible to solvent, prediction accuracy of PHDsec was relatively high (70%) for the known structures. Thus, PHDsec appears to be applicable.
How to use the prediction of transmembrane helices?
One possible application of PHDhtm is to scan, e.g., entire chromosomes for possible transmembrane proteins. The classification as transmembrane protein is not sufficient to have knowledge about function, but may shed some light into the puzzle of genome analyses. When using PHDhtm for this purpose, the user should keep in mind that on average about 5% of the globular proteins are falsely predicted to have transmembrane helices.
What about protein design and synthesised peptides?
The PHD networks are trained on naturally evolved proteins. However, the predictions have proven to be useful in some cases to investigate the influence of single mutations (e.g. for Chameleon ), or for Janus, Rost, unpublished). For short poly-peptides, the following should be taken into account: the network input consists of 17 adjacent residues, thus, shorter sequences may be dominated by the ends (which are treated as solvent).
70% correct implies 30% incorrect.
The most accurate methods for predicting secondary structure reach sustained levels of about 70% accuracy. When interpreting predictions for a particular protein it is often instructive to mark the 30% of the residues you suspect to be falsely predicted.
Spread of prediction accuracy.
An expected accuracy of 70% does NOT imply that for your protein U 70% of all residues are correctly predicted. Instead, values published for prediction accuracy are averaged over hundreds of unique proteins. An expected accuracy of 70±10% (one standard deviation) implies that, on average, for two thirds of all proteins between 60 and 80% of the residues will be predicted correctly (expected accuracy of PHDsec). Thus, prediction accuracy can be higher than 80% or lower than 60% for your protein. Few methods supply well tested indices for the reliability of predictions. Such indices can help to reduce or increase your trust in a particular prediction.
Special classes of proteins.
Prediction methods are usually derived from knowledge contained in subsets of proteins from databases. Consequently, they should not be applied to classes of proteins which have not been included in the subsets. For example, methods for predicting helices in globular proteins are likely to fail when applied to predict transmembrane helices. In general, results should be taken with caution for proteins with unusual features, such as proline-rich regions, unusually many cysteine bonds, or for domain interfaces.
Better alignments yield better predictions.
Multiple alignment-based predictions are substantially more accurate than single sequence-based predictions. How many sequences do you need in your alignment to expect an improvement; and how sensitive are prediction methods with respect to errors in the alignment? The more divergent sequences contained in the alignment, the better (two distantly related sequences often improve secondary structure predictions by several percentage points). Regions with few aligned sequences yield less reliable predictions. The sensitivity to alignment errors depends on the methods, e.g., secondary structure prediction is less sensitive to alignment errors than accessibility prediction.
Better + worse = even better?
Today, several automatic services accomplish secondary structure predictions. Some users fall into the what-is-common-is-correct trap, i.e., they average over all prediction methods and consider identical regions as more reliable. Exceptionally, such a majority vote may be beneficial. However frequently, the result will be the worst-of-all prediction. Often, it is preferable to use reliability indices provided by some methods. Such indices answer the question: how reliably is the tryptophan at position 307 predicted in a surface loop? (Note: the correlation between such indices and prediction accuracy is sufficiently tested for a few methods, only.)
1D structure may or may not be sufficient to infer 3D structure.
Say you obtain as prediction for regular secondary structure: helix-strand-strand-helix-strand-strand (H-E-E-H-E-E). Assume, you find a protein of known structure with the same motif (H-E-E-H-E-E). Can you conclude that the two proteins have the same fold? Yes and no, your guess may be correct, but there are various ways to realise the given motif by completely different structures. For example, the secondary structure motif 'H-E-E-H-E-E' is contained in, at least, 16 structurally unrelated proteins.
If the multiple sequence alignment contains only a few proteins very similar to the one you sent (pairwise sequence identity > 90%), the expected accuracy for 1D structure predictions (secondary structure, accessibility, transmembrane helices) drops significantly. Note: this implies a reduction of the expected accuracy for threading. The scores for expected accuracy (PHDsec, PHDacc., PHDhtm) are valid for typical alignments as to be found in the HSSP database. The information content of the alignment is difficult to measure. Two important parameters are:
- (1) Number of aligned sequences: the more sequences in the alignment, the better. The exact number of sequences needed for a 'good prediction' cannot be given, as it depends on the variation and on characteristics of the particular protein family. As a rule of thumb: one is clearly NOT sufficient, more than five sequences can be enough.
- (2) Variation of aligned sequences: the aligned sequences should have a considerable variation with respect to the guide sequence (your protein). Ideally, the alignment should contain sequences at levels of 80%, 60%, 50%, 40%, and about 30% pairwise sequence identity (with respect to the predicted protein). In general, more diverged sequences (30-40%) contribute more to the information content than do very similar ones (> 80%). Note: the levels of sequence identity are summarised in the alignment header of the output returned (example).
In the multiple sequence alignment returned to you, only homologues down to levels of 30% pairwise sequence identity over 80 or more residues are included. This cut-off is five percentage points above the threshold for structural homology (Sander & Schneider, 1990), in an attempt to stay clearly off the twilight zone of sequence similarity, and provide high-quality multiple alignments in an automated fashion.
- (A) Alignment errors for distant homologues:
More distantly related sequences contribute more to the alignment diversity which is the base for an improved prediction accuracy. However, the more distant relative are difficult to align (actually below levels of some 40% sequence identity some alignment errors are guaranteed). Furthermore, even the correct detection of more distant relatives is getting highly complicated below levels of about 35% sequence identity.
- (B) Bias by identical sequences:
Growing data bases result in an explosion of highly redundant information. This has recently (1996-7) led to the situation where the previous rule 'the more sequences, the better' is not applicable anymore. Instead, you should leave out some (or all) family members in the high homology (>70%) region, in particular, when there are not many rather diverged sequences present. Furthermore, the current version of PHD does not handle redundant information, i.e., when you have two proteins A and B of say 40% sequence identity to your query, and when A and B are highly similar (>90% sequence identity to one another), you should leave out one of the two from the alignment you use for the prediction!
On average, more residues are falsely aligned for lower levels of pairwise sequence identity. Down to levels of about 30%, the automatic MaxHom alignments are usually quite accurate. However, for many families there are regions for which the 'correct' alignment is, in principle, not well defined. One way to spot such regions is the stability of the alignment with respect to including or excluding some of the aligned sequences. By providing different lists of sequences ("input option 'PIR list'") you can monitor the stability of the alignment. Often such regions may form surface loops. Predictions may be less accurate in such regions.
The PHD programs treat N- and C-terminal ends of proteins as solvent molecules. The size of the input window for predicting 1D structure is up to 17 residues. Thus, the first and the last 17 residues of your sequence will 'see solvent'. Especially for short fragments you did cut out from large proteins, this may result in false predictions.
- Insertions in guide sequence:
Do NOT use insertions for the guide sequence when you supply your alignment to be used as input for the predictions ("input option 'MSF format'"). In the current implementation, PHD will treat such insertions as if the corresponding positions were occupied by solvent. This may lead to particular prediction errors ( example )!
- Split alignment into domains:
If your alignment (of say 20 sequences) contains long (> 10 residues) regions for which only very few sequences do not have insertions (in positions R1-R2), split the alignment into fragments that are not full of insertions for all sequences. For the problematic region (R1-R2) it may be better to include only those sequences without insertions. The existence of such regions may indicate that the protein contains various domains (one for residues < R1, another for residues > R2). When you submit your alignment in fragments, mind the minimal length of sequences (see above).
- Globular, water-soluble proteins.
The PHD neural networks have been trained on proteins with typical features as contained in the database of known protein structures (PDB). Thus, accuracy may be lower if the methods are applied to other proteins. For instance, PHDsec (secondary structure) correctly predicts only about 50% of the residues in transmembrane helices of integral membrane proteins. However, the network system trained on transmembrane proteins (PHDhtm) predicts residues in transmembrane helices on average at a level of well above 90% accuracy. In general, the PHD methods learn to extract characteristics features of currently known protein structures. Problematic cases are proteins with many cysteine bridges that stabilise the particular protein structure, or proteins for which the structure is stabilised by functional constraints (co-factors).
- Transmembrane proteins.
PHDhtm for globular proteins. The rate of false positives, i.e., of globular, water-soluble proteins for which PHDhtm predicts transmembrane helices, is in the order of 5%. Such false positive predictions occur more often for structures with very hydrophobic beta-strands. Consequently, a prediction of transmembrane helices for a globular protein may indicate the existence of very hydrophobic beta-strands. PHD for porin-like beta structures. For the beta-strand transmembrane protein, porin, the accuracy of PHDsec was below the expected average (60%), but it was higher than the average for helical transmembrane proteins (50%). The explanation may be that the barrels formed by porins share features of globular, water-soluble proteins and thus can be predicted relatively well. MaxHom alignments for transmembrane proteins. The alignment procedure MaxHom is optimised on globular water-soluble proteins. For transmembrane proteins, the alignments of the more hydrophobic transmembrane segments may require changes in the alignment details. Furthermore, in particular in transmembrane regions, often more distantly related sequences could be aligned by hand based on, e.g., hydrophobicity analyses. Unfortunately, we do not yet provide such refinements of the alignment automatically.
- Multi-domain proteins.
The accuracy for predicting solvent accessibility (PHDacc) for single-domain proteins is higher than for multi-domain proteins. Predictions are more likely to be wrong at interfaces between domains. This shortcoming may be used to predict inter-domain interfaces in regions where PHDacc predicts buried residues that would otherwise not be compatible with your guess about the fold of the protein.
- Novel folds are NOT 'untypical proteins'.
The expected prediction accuracy for the PHD programs has been re-evaluated several times over the last years. So far, the results have always confirmed our estimates (Rost & Sander, 1995, Proteins, 1995, 23, 295-300).
- False positives:
globular proteins predicted with HTM's. By default we search for possible transmembrane helices in your sequence. The rate of false positive detection (i.e. proteins falsely predicted to contain transmembrane helices) is about 1.6%. Thus, a reported transmembrane segment may just indicate a rather hydrophobic patch in a globular protein. If you explicitly request a prediction of transmembrane helices (HTM's), we assume that you know the protein to contain HTM's and consequently apply a lower threshold to eliminate false positives.
- Refined prediction
of transmembrane helices and topology. By default we use the neural network system PHDhtm and an empirical filter to predict the locations of transmembrane helices. A refined (more accurate) version of that program, as well as, the prediction of transmembrane topology (orientation of N-terminal non-transmembrane region with respect to cell) is available upon request ("predict htm topology"). All predicted HTM's are sorted according to the reliability of the prediction. This may help experts to spot HTM's predicted falsely based on a reliability index provided. Note: try NOT to provide sequences that start or end with HTM regions as this may result in wrong topology predictions!
The reliability indices of the PHD methods correlate well with prediction accuracy. In other words, residues predicted with high reliability (0 = low, 9 = high) are more likely to be predicted correctly. However, when basing the prediction on single sequences (rather than multiple alignments) the scale has to be shifted. For instance, values of RI > 4 usually imply an expected accuracy of > 80% for PHDsec. When using a single sequence as input the same level of accuracy is reached only for residues predicted at RI > 7.
A combination of two prediction methods is likely to improve the accuracy only if the following points are met:
Say you want to focus on the most likely secondary structure segments. You may hope that the best predicted segments are those for which methods X and PHDsec agree. This may or may not be correct. However, it may be more reasonable to identify such regions based on the reliability index provided by PHD. The PHD methods have been tailored to provide a reasonable estimate for the reliability of the prediction, whereas a combination of two arbitrary prediction methods, at best, yields improvements, at random.
- (1) the predictions are based on methods using independent information, e.g., prediction-based threading and potential-based threading,
- (2) the accuracy of the two methods is comparable, e.g., NOT for combining Chou-Fasman (about 50% accuracy) and PHDsec (> 72% accuracy),
Ab-initio prediction (by e.g. PHD) is, in general, less accurate than is homology modelling. Thus, if we find a protein of known structure that has > 25% pairwise sequence identity to your sequence, you ought to make use of the known structure by homology modelling.
- Most alignments returned are wrong!
We decided to return quite a number of alignment hits from the threading search. Most of those will be wrong. On the one hand, this is caused by a rather low accuracy of threading methods. On the other hand, successful threading requires experience in analysing protein sequences on the side of the user. Although most hits reported by the threading program (TOPITS alias TOPITS) will be wrong, you may arrive at some correct conclusions from the alignments. Just bear in mind: threading is likely to be more often wrong than correct.
- Combining prediction-based and potential-based threading.
The first problem of all threading programs is a high proportion of false positives, i.e., proteins falsely predicted to have a fold similar to the search sequence. One successful strategy to reduce the number of false positives is a combination of the results from prediction-based threading (such as TOPITS) with those from potential-based threading.
PP home -
PP help TOC