ab_rev_currop94

Bridging the protein sequence-structure gap by structure predictions

Rost, Burkhard & Sander, Chris

Annual Review of Biophysics and Biomolecular Structure, 25, 1995, in press.

Text

KEY WORDS: multiple alignments; secondary structure; solvent accessibility; transmembrane helices; inter-residue contacts; homology modelling; threading; knowledge-based mean-force potentials

Introduction				
Sequence alignments			
Evaluation of prediction methods	
Prediction in 1D			
	Secondary structure		
	Solvent accessibility		
	Transmembrane helices		
Prediction in 2D			
	Inter-residue contacts		
	Inter-strand contacts		
	Inter-cysteine contacts		
Prediction in 3D			
	Homology modelling		
	Remote homology modelling (threading)
Analysis of 3D structures		
Conclusion				

Figure captions				
Literature Cited			
Abbreviations used

INTRODUCTION

SEQUENCE-STRUCTURE GAP

Large-scale sequencing projects produce data of gene, and hence protein, sequences at a breathtaking pace. Although determination of protein three-dimensional (3D ×) structure by crystallography has become more efficient (51), the gap between the number of known sequences (80,000 - 90,000; 5, 7) and the number of known structures (3,000; 8) is rapidly increasing. For many proteins it has been shown that sequence uniquely determines structure, i.e., the entire information for the details of 3D structure is contained in the sequence (4). Thus, in principle, protein structure could be predicted from physico-chemical principles given only the sequence of amino acids. However, in practice, prediction from first principles, e.g. by molecular dynamics, is prevented by the high complexity of protein folding (with required computing time orders of magnitude too high) and by the inaccuracy of the experimental determination of basic parameters (93). Therefore, most protein structure prediction tools are knowledge based using a combination of statistical theory and empirical rules. Given a protein sequence of unknown structure (dubbed U), what can we uncover about the structure of U by using theoretical tools, or what can theory contribute to bridging the sequence-structure gap? >>> insert Figure 1 here <<<

PREDICTION OF PROTEIN STRUCTURE IN 1D, 2D, AND 3D

The most successful tool for prediction of 3D structure is homology modelling. An approximate 3D model (which has a correct fold, but inaccurate loop regions) can be constructed if U has significant similarity to a protein of known structure, evaluated in terms of pairwise sequence identity (i.e. by alignment) or sequence-structure fitness (i.e. threading). Homology modelling effectively raises the number of 'known' 3D structures from 3,000 to about 20,000 (80). Threading methods may be used to make tentative predictions of 3D structure for approximately an additional 7,000 proteins. Consequently, theory-based tools already contribute significantly to bridging the sequence-structure gap (Figure 1). However, if U has no homologue of known 3D structure, we are forced to resort to simplifications of the prediction problem. In the process, we can make use of the rich diversity of information in current databases. In this review, we have focused on generic methods for prediction at three different levels of simplification (Figure 2), namely one, two, and three dimensions (Figure 3). We have included only methods which are available by automatic prediction services or programs and thus could be used in analysing large numbers of sequences, e.g., entire chromosomes (25, 42). The underlying question for every method will be: what is the method's practical contribution to the problems of protein structure prediction and analysis? >>> insert Figure 2 + 3 here <<<

SEQUENCE ALIGNMENTS

EVOLUTION DISTINGUISHES SIGNAL FROM NOISE

At the level of protein molecules, selective pressure results from the need to maintain function, which in turn requires maintenance of the specific 3D structure (21) This is the basis for attempts to align protein sequences, i.e., to optimally detect equivalent positions in strings of amino-acid letters. Accordingly, conservation and mutation patterns observed in alignments contain very specific information about 3D structure. How much variation is tolerated without loss of structure? Two naturally evolved proteins with more than 25% identical residues (length > 80 residues) are very likely to be similar in 3D structure (79). Even so, structure may be conserved in spite of much higher divergence (36). Do we have enough data to detect structure-specific sequence motifs (67) and to correctly align even remote homologues (i.e., sequences with less than 25% pairwise identical residues)?

MULTIPLE ALIGNMENTS IMPROVE AS DATABASES GROW

When the level of pairwise sequence identity is sufficient (say above 40%), alignment procedures are (more or less) straightforward (24, 44, 79). Using fast alignment tools one can scan entire databases containing 100,000 sequences in minutes (FASTA, 65; BLAST, 3). For less similar protein sequences, however, alignments may fail (30, 94). The art of sequence alignment is to accurately align related sequence segments and to avoid aligning unrelated sequence stretches (20, 22, 31, 48, 52, 57, 75, 79, 92). Alignment techniques can be improved by incorporating information derived from 3D structures (30). Profile-based multiple alignments appear to be sensitive and fast enough to scan entire databases if implemented on parallel machines (80). Which alignment method to use?

DRAWBACK: LACK OF SUFFICIENTLY TESTED CUT-OFF CRITERIA

One of the difficulties in comparing different alignment procedures is the lack of well-defined criteria for measuring the quality of an alignment. Very few papers have attempted to define such measures for the comparison of various methods (22, 30). The second problem for users is that most methods do not supply a cut-off criterion for distinguishing between homologous and non-homologous sequences (i.e., false positives). For some large sequence families remote homologues can be aligned correctly (57, 92), but for most cases sequences with less than 25% sequence identity will be false positives, i.e., will have no structural or functional similarity to the guide sequence. A simple length-dependent cut-off based on sequence identity is provided by the program MAXHOM (79). However, this does not quantify the influence of (more subtle) similarities and of the occurrence of gaps.

EVALUATION OF PREDICTION METHODS

PUBLISHING OPTIMISTIC RESULTS?

A systematic testing of performance is a precondition for any prediction to become reliably useful. For example, the history of secondary structure prediction has partly been a hunt for highest accuracy scores, with over-optimistic claims by predictors seeding the scepticism of potential users. One major point about prediction methods became clear at the first international meeting for the evaluation of these methods in Asilomar, California, Dec., 1994 (19): exaggerated claims are more damaging than genuine errors. Even a prediction method of limited accuracy can be useful if the user knows what to expect. For the editors of scientific journals this implies that no protein structure prediction method should be published that has not been sufficiently cross-validated. This raises a difficult question: how to evaluate prediction methods?

REQUIREMENTS FOR SUFFICIENT EVALUATION OF PREDICTION METHODS

Given a separation of a data set into a training set (used to derive the method) and a test set (or cross-validation set, used to evaluate performance), a proper evaluation (or cross-validation) of prediction methods, in our view, needs to meet four requirements. (1) No significant pairwise sequence identity between training and test set: the proteins used for setting up a method (training set) and those used for evaluating it (test set) should have a pairwise sequence identity of less than 25% (length-dependent cut-off (79)), otherwise homology modelling could be applied which would be much more accurate than ab initio predictions (74, 76). (2) Comprehensive tests through using a large data set: all available unique proteins should be used for testing (currently more than 400 (32)). The reason for taking as many proteins as possible is simply that proteins vary considerably in structural complexity; certain features are easy to predict, others harder (Figure 5). (3) Avoid comparing apples with oranges: no matter which data sets are used for a particular evaluation, a standard set should be used for which results are also always reported (Figure 4). (4) No optimisation with respect to the test set: a seemingly trivial - and often violated - rule is that methods should never be optimised with respect to the data set chosen for final evaluation. In other words, the test set should never be used before the method is set up. (For example, using a cross-validation set to indicate when over-training on the training data has occurred or to find out how many parameters should be used to describe the model is an implicit use of the cross-validation set in parameter optimisation. The data reserved to test the method should therefore never be used in two ways.)

NUMBER OF CROSS-VALIDATION EXPERIMENTS OF NO MEANING

Most methods are evaluated in n-fold cross-validation experiments (splitting the data set into n different training and test sets). How many separations should be used, i.e., which number of n yields the best evaluation? A misunderstanding is often spread in the literature: the more separations (the larger n ) the better. However, the exact number of n is not important provided the test set is representative, comprehensive and the cross-validation results are NOT miss-used to again change parameters. In other words, the choice of n is of no meaning for the user.

PREDICTION IN 1D

Secondary structure

BASIC CONCEPT

The principal idea underlying most secondary structure prediction methods is the fact that segments of consecutive residues have preferences for certain secondary structure states (46). Thus, the prediction problem becomes a pattern-classification problem tractable by computer algorithms. The goal is to predict whether the residue at the centre of a segment of typically 13-21 adjacent residues is in a helix, strand or in no regular secondary structure. Many different algorithms have been applied to tackle this simplest version of the protein structure prediction problem (70, 72). However, until recently, performance accuracy seemed to have been limited to about 60% (percentage of residues correctly predicted in either a-helix, b-strand or another conformation). How to improve prediction accuracy?

USE OF EVOLUTIONARY INFORMATION IN SEQUENCES IS THE KEY TO IMPROVED SECONDARY STRUCTURE PREDICTIONS

The first method that reached a sustained level of a three-state prediction accuracy above 70% was the profile-based neural network program PHD which uses multiple sequence alignments as input (70). By stepwise incorporation of more evolutionary information, prediction accuracy can be pushed above 72% accuracy (72). A nearest-neighbour algorithm can be used to incorporate the same information with a similar performance (77) (Figure 4). A method combining statistics and multiple alignment information (53) is clearly less accurate (Figure 4). In comparison to methods using single sequence information only, methods making use of the growing databases are 6-14 percentage points more accurate (Figure 4). Is this improvement of practical use? >>> insert Figure 4 here <<<

RECENT IMPROVEMENTS TO SECONDARY STRUCTURE PREDICTION MAKE THESE METHODS OF PRACTICAL USE

How good is a prediction accuracy of 72%? It is certainly reasonably good compared with the prediction of secondary structure by homology modelling (16, 74, 76). In addition, it should be kept in mind that some residues within a structure are predicted at higher levels of accuracy than the mean value, i.e., prediction accuracy is 72%±9% (one standard deviation; Figure 5). Various applications of improved secondary structure predictions prove that predictions are accurate enough to be of practical use (prediction-based threading, (40, 68); inter-strand contact prediction, (39); chain tracing in X-ray crystallography; design of residue mutations). Can we increase the 72%±9% accuracy level by first predicting secondary structure content (proportion of residues in a-helix, b-strand and other) and by then using this initial classification to refine secondary structure prediction? >>> insert Figure 5 here <<<

PREDICTION OF SECONDARY STRUCTURE CONTENT IS NOT VERY USEFUL

Proteins have been partitioned into various structural classes, e.g. based on the percentage of residues assigned to a-helix, b-strand and other (55). However, such a coarse-grained classification is not well-defined (36). Consequently, given a protein sequence U of unknown structure, attempts to first predict the secondary structure content for U and then to use the result to predict the secondary structural class (i.e., all-a, all-b or intermediates) is of limited practical use. How do alignment-based predictions compare to experimental means of determining the content in secondary structure? Surprisingly, PHD was, on average, about as accurate as circular dichroism spectroscopy (70, 72). Of course, this does not imply that predictions can replace experiments. In particular, variation of secondary structure as a result of changes in environmental conditions (e.g. solvent) is generally only accessible experimentally. Can the separation into structural classes be used to improve predictions by tailoring methods with respect to a particular class of proteins, e.g., all-a proteins?

NO SIGNIFICANT GAIN BY SPECIALISING ON ALL-a PROTEINS

One attempt to improve secondary structure predictions was to develop methods specifically for all-a helix proteins. Two points were often confused in the literature. (1) A two-state accuracy (helix, non-helix) is not comparable to a three-state accuracy (helix, strand, other). For example, PHDsec has an expected three-state accuracy of about 72% and an expected two-state accuracy of about 82% (71). (2) To apply a method specialised on all-a proteins to U, first the structure type of U has to be predicted. Such a prediction has an expected accuracy of 70-80% (72). Even if the accuracy for determining whether U belongs to the all-a class reached almost 100% (99), as recently claimed, specialised methods are currently still not very useful as the improvement in accuracy by specialising on one class has been only marginal (71).

Solvent accessibility

BASIC CONCEPT

The principal goal is to predict the extent to which a residue embedded in a protein structure is accessible to solvent. Solvent accessibility can be described in several ways (73). The simplest is a two-state description distinguishing between residues that are buried (relative solvent accessibility < 16%) and exposed (relative solvent accessibility „ 16%). The classical method to predict accessibility is to assign either of the two states, buried or exposed, according to residue hydrophobicity (for overview see 73). However, a neural network prediction of accessibility has been shown to be superior to simple hydrophobicity analyses (33).

EVOLUTIONARY INFORMATION IMPROVES PREDICTION ACCURACY

Solvent accessibility at each position of the protein structure is evolutionarily conserved within sequence families (73). This fact has been used to develop methods for predicting accessibility using multiple alignment information (6, 73, 97). Prediction accuracy is about 75±7%, four percentage points higher than for methods not using alignment information (Figure 6). Predictions are accurate enough to be used as a seed for predicting secondary structure (6, 97), but not accurate enough to become as useful as secondary structure predictions (68). >>> insert Figure 6 here <<<

Transmembrane helices

BASIC CONCEPT

Even in the optimistic scenario that in the near future most protein structures will be either experimentally determined or theoretically predicted, one class of proteins will still represent a challenge for experimental determination of 3D structure: transmembrane proteins. The major obstacle with these proteins is that they do not crystallise and are hardly tractable by NMR spectroscopy. Consequently, for this class of proteins structure prediction methods are even more needed than for globular water-soluble proteins. Fortunately, the prediction task is simplified by strong environmental constraints on transmembrane proteins: the lipid bilayer of the membrane reduces the degrees of freedom to such an extent that 3D structure formation becomes almost a 2D problem. Once the location of transmembrane segments is known for helical transmembrane proteins, 3D structure can be predicted by exploring all possible conformations (91). Additionally, predicting the locations of these transmembrane helices is a much simpler problem than is the prediction of secondary structure for soluble proteins. Elaborated combinations of expert-rules, hydrophobicity analyses and statistics yields a two-state per-residue accuracy above 90% (43, 69, 83, 95). Can evolutionary information be used once more to improve prediction accuracy?

EVOLUTIONARY INFORMATION IMPROVES PREDICTION ACCURACY

For two methods the use of multiple alignment information is reported to clearly improve the accuracy of predicting transmembrane helices (66, 69). The best current prediction methods have a similar high accuracy around 95%. As reliable data for the locations of transmembrane helices exists only for a few proteins, data used for deriving these methods originate predominantly from experiments in cell biology and gene-fusion techniques. Different authors often report different locations for transmembrane regions. Thus, the level of 95% accuracy is not verifiable. Despite this uncertainty in detail, the prediction of transmembrane helices is a valuable tool to quickly scan entire chromosomes (69). The classification into membrane/not-membrane proteins has an expected error rate of less than five percent, i.e., about five percent of the proteins predicted to contain transmembrane regions will probably be false positives. Can the prediction of transmembrane helices be extended to a 3D structure prediction?

PREDICTION OF 3D STRUCTURE FOR HELICAL TRANSMEMBRANE PROTEINS?

Cytoplasmic and extra cellular regions have different amino acid compositions (61, 95). This difference allows for a successful prediction of the orientation of transmembrane helices with respect to the cell (pointing inside or outside the cell; 43, 83). Such predictions are estimated to be correct in more than 75% of all proteins (43). Going one step further, Taylor and colleagues (91) have correctly predicted the 3D structure for the membrane-spanning regions of G-coupled receptors (seven helices) when starting from the known locations of the helices. For a successful automatic prediction of 3D structure from sequence, the N- and C-terminal ends of transmembrane helices have to be predicted very accurately. It remains to be tested whether or not current prediction methods for the location of transmembrane helices are sufficiently accurate to automatically predict 3D structure of integral membrane proteins.

PREDICTION IN 2D

Inter-residue contacts

PREDICTION PROBLEM IS A HARD ONE, BUT THE STAKES ARE HIGH

Given all inter-residue contacts or distances (Figure 2), 3D structure can be reconstructed by distance geometry (13, 63). Distance geometry is used for the determination of 3D structures by nuclear magnetic resonance (NMR) spectroscopy which produces experimental data of distances between protons (13). Can inter-residue contacts be predicted? Obviously, some fraction of these contacts can be: helices and strands can be assigned based on hydrogen-bonding pattern between residues (45). Thus, a successful prediction of secondary structure implies a successful prediction of some fraction of all the contacts. However, contacts predicted from secondary structure assignment are short-ranged, i.e., between residues nearby in sequence. For a successful application of distance geometry, long-range contacts have to be predicted, i.e., contacts between residues far apart in the sequence. A few methods have been proposed for the prediction of long-range inter-residue contacts. Two questions surround such methods: first, how accurate are these prediction methods on average; and second, are all important contacts predicted?

CORRELATED MUTATIONS CAN IMPLY SPATIAL PROXIMITY

In sequence alignments, some pairs of positions appear to co-vary in a physico-chemically plausible manner, i.e., a 'loss of function' point mutation is often rescued by an additional mutation that compensates for the change (2). One hypothesis is that compensation would be most effective in maintaining a structural motif if the mutated residues were spatial neighbours. Attempts have been made to quantify such a hypothesis (62, 90) and to use it for contact predictions (29, 81). Applying a stringent significance cut-off in the prediction of contacts by correlated mutations, a small number of residue contacts can be predicted between 1.4 and 5.1 times better than random (29); further slight improvements are possible (D Thomas, unpublished). Are these predictions accurate enough to apply distance geometry to the results?

DISTINCTION BETWEEN DIFFERENT MODELS, NO PREDICTION OF 3D, YET

Analysing correlated mutations is only one way to predict long-range inter-residue contacts. Other methods use statistics (26), mean-force potentials (X Tamames, A Valencia, unpublished), or neural networks (10). So far none of the methods appears to find a path between the Scylla of missing too many true contacts and the Charibdis of predicting too many false contacts. However, some of the methods may provide sufficient information to distinguish between alternative models of 3D structure (A Valencia, unpublished). The ambitious goal to predict long-range inter-residue contacts sufficiently accurately will hopefully continue to attract intellectual resources.

Inter-strand contacts

SIMPLIFYING THE CONTACT PREDICTION PROBLEM

One simplification of the problem to predict inter-residue contacts focuses on predicting the contacts between residues in adjacent b-strands. Such an attempt is motivated by the hope that such interactions are more specific than sequence-distant (long-range) contacts in general and hence are easier to predict.

IDENTIFYING THE CORRECT b-STRAND ALIGNMENT

The only method published for predicting inter-strand contacts is based on potentials of mean-force (39) similar to those used in the evaluation of strand-strand threading (56). Propensities are compiled by database counts for 2 ¥ 2 ¥ 2 classes (parallel/anti-parallel, H-bonded/not H-bonded, N-/C-terminal). Each of the eight classes is divided further into five sub-classes in the following way. Suppose the two strand residues at positions i and j are in close in space. Then the following five residue pairs are counted in separate tables: i/j-2, i/j-1, i/j, i/j+1, i/j+2 . Such pseudo-potentials identify the correct b-strand alignment in 35-45% of the cases. Can these potentials be used to predict inter-strand contacts?

USING EVOLUTIONARY INFORMATION TO PREDICT INTER-STRAND CONTACTS

Even if the locations of b-strands in the sequence are known exactly, the pseudo-potentials cannot predict the correct inter-strand contacts in most cases (39). However, when using multiple alignment information, the signal-to-noise ratio increases such that inter-strand contacts have been predicted correctly for most of the strands inspected in some test cases (39). For the purpose of reliable contact prediction, this result is inadequate, especially as the locations of the strands are not known precisely. Can the pseudo-potentials handle errors resulting from incorrect prediction of strands? Various test examples using predictions by PHDsec (72) as input to the b-strand pseudo-potentials indicate that the accuracy in predicting inter-strand contacts drops (T Hubbard, unpublished), but in some cases is still high enough to be useful for approximate modelling of 3D structure (40).

Inter-cysteine contacts

A VERY SIMPLE CONTACT PREDICTION PROBLEM

An extreme simplification of contact prediction problem focuses on predicting contacts between cysteine residues (disulphide bridges). Previously, such contacts were obtained by experimental protein sequencing techniques. In the age of gene-sequencing projects, however, disulphide bridges are not part of the sequence information anymore. Disulphide bond predictions are interesting for two reasons. Firstly, disulphide bridges are crucial for structure formation of many proteins. Secondly, contacts between cysteines account for the most dominant signal in predicting inter-residue contacts by mean-force potentials. How can cysteine-bridges be predicted?

SO FAR ONLY PREDICTION IN 1D

One method for the prediction of disulphide-bonds uses a neural network to predict the bonding state of single cysteines (60), i.e., the goal is not to predict which cysteine pair is in contact, but whether or not a cysteine residue is in contact to any other one. Thus, strictly speaking, the method operates in 1D. One result is that the cysteine bonding state appears to be influenced by the local sequence environment of up to 15 adjacent residues. Prediction accuracy in two states is claimed to be about 80%. However, the result may be over-optimistic for two reasons. First, the test set was rather small (140 examples). Second, in the cross-validation experiments training and testing examples were not separated based on the level of pairwise sequence identity. Therefore, the question of how accurately inter-cysteine contacts can be predicted remains to be answered.

PREDICTION IN 3D

Homology modelling

BASIC CONCEPT

An analysis of PDB reveals that all protein pairs with more than 30% pairwise sequence identity (for alignment length > 80; 79) have homologous 3D structures, i.e., the essential fold of the two proteins is identical, details such as additional loop regions may vary. Structure is more conserved than is sequence. This is the pillar for the success of homology modelling. The principal idea is to model the structure of U (protein of unknown structure) based on the template of a sequence homologue of known structure. Consequently, the precondition for homology modelling is that a sequence homologue of known structure is found in PDB. Since homology modelling is currently the only theoretical means to successfully predict 3D structure, this has two implications. First, homology modelling is applicable to 'only' one quarter of the known protein sequences (Figure 1). Second, as the template of a homologue is required, no unique 3D structure can yet be predicted, i.e., no structure that has no similarity to any experimentally determined 3D structure. Suppose, there is a protein with a sequence similar to U in PDB (say HU), is homology modelling straightforward?

HIGH LEVEL OF SEQUENCE IDENTITY: ATOMIC RESOLUTION

The basic assumption of homology modelling is that U and HU have identical backbones. The task is to correctly place the side chains of U into the backbone of HU. For very high levels of sequence identity between U and HU (ideally differing by one residue only), side chains can be 'grown' during molecular dynamics simulations (17, 47). For slightly lower levels (still of high sequence similarity), side chains are built based on similar environments in known structures (18, 23, 54, 58, 78, 89, 96). Rotamer libraries are used in the following way (18). (1) rotamer distributions are extracted from a database of non-redundant sequences. (2) fragments of seven (helix, strand) or five residues (other) are compiled. (3) fragments of the same length are successively shifted through the backbone of U. (4) for modelling the side chains of U only those fragments from the rotamer library are accepted which have the same amino acid in the centre as U, and for which the local backbone is similar to that around the evaluated position). Over the whole range of sequence identity between U and HU for which homology modelling is applicable, the accuracy of the model drops with decreasing similarity. For levels of at least 60% sequence identity, the resulting models are quite accurate (18) (for even higher values, the models are as accurate as is experimental structure determination). The limiting factor is the computation time required (34). How accurate is homology modelling for lower levels of sequence identity?

LOW LEVEL OF SEQUENCE IDENTITY: LOOP REGIONS SOMETIMES CORRECT

With decreasing sequence identity the number of loops inserted grows. An accurate modelling of loop regions, however, implies solving the structure prediction problem. The problem is simplified in two ways. First, loop regions are often relatively short and can thus be simulated by molecular dynamics (note the CPU time required for molecular dynamics simulations grows exponentially with the number of residues of the polypeptide to be modelled). Second, the ends of the loop regions are fixed by the backbone of the template structure. Various methods are employed to model loop regions. The best have the orientation of the loop regions correct in some cases (e.g. 1). Below about 40% sequence identity the accuracy of the sequence alignment used as basis for homology modelling becomes an additional problem. However, even down to levels of 25-30% sequence identity, homology modelling produces coarse-grained models for the overall fold of proteins of unknown structure.

Remote homology modelling (threading)

BASIC CONCEPT

As noted in the previous section, naturally evolved sequences with more than 30% pairwise sequence identity have homologous 3D structures (79). Are all others non-homologous? Not at all. In the current PDB database there are thousands of pairs of structurally homologous pairs of proteins with less than 25% pairwise sequence identity (remote homologues) (36). If a correct alignment between U (sequence of unknown structure) and a remote homologue RU (pairwise sequence identity to U < 25%) is given, one could build the 3D structure of U by homology modelling based on the template of RU (remote homology modelling). A successful remote homology modelling must solve three different tasks. (i) the remote homologue (RU) has to be detected. (ii) U and RU have to be correctly aligned. (iii) the homology modelling procedure has to be tailored to the harder problem of extremely low sequence identity (with many loop regions to be modelled). Most methods developed so far have been primarily addressed to detect similar folds (i). The basic idea is to thread the sequence of U into the known structure of RU and to evaluate the fitness of sequence for structure by some kind of environment-based or knowledge-based potential (14, 86). Threading is in some respects a harder problem than is the prediction of 3D structure (50, 86). However, solving it would enable the prediction of thousands of protein structures (Figure 1). Can this hard nut be cracked?

SOMETIMES REMOTE HOMOLOGUES ARE CORRECTLY IDENTIFIED

The optimism generated by one of the first papers on threading published in the 90s (11) has boosted attempts to develop threading methods (86). Most methods are based on pseudo-potentials and differ in the way such potentials are derived from PDB (98). One alternative is to use 1D predictions for the threading procedure (68; G Barton, unpublished; F Drabløs, unpublished). The good news after half a decade of intensive research by dozens of groups is that all potentials capture different aspects, and it is likely that the correct remote homologue is found by at least one of them (82). The bad news is that no single method is accurate enough to correctly identify the remote homologue in most cases (82). Instead, evaluated on a larger test set, the correct remote homologue appears to be detected in about 30% of all cases (68). Unfortunately, this is only the first of the three tasks for successful remote homology modelling, the second (correct alignment of U and RU) is even harder. In many of the cases for which RU is correctly identified as remote homologue of U, the alignment of U and RU is flawed in significant ways (unpublished data). This is fatal for the third step, the model-building procedure. Thus, is threading useful, at all?

THREADING IS USEFUL FOR SCEPTICAL USERS

Like all prediction methods, threading techniques are not error-proof. One of the practical disadvantages of current tools is the lack of a successful measure for prediction reliability, such as that established for secondary structure prediction (Figure 5). The conclusion seems to be that threading methods can be useful in the hands of rather sceptical expert users who can spot wrong hits and false alignments, even when the prediction method suggests a high confidence value for the error it generates. Three points may be added. First, threading techniques can clearly widen the range of successful sequence alignments (68). Second, some methods are accurate enough to be used in scanning entire chromosomes for remote homologues (12). Third, threading techniques may still become one of the most successful tools in structure prediction, but a lot of detailed work lies ahead.

ANALYSIS OF 3D STRUCTURES

SUCCESSFUL IDENTIFICATION OF NATIVE-LIKE 3D STRUCTURES

A successful idea was to replace inductive force-fields capturing the heuristics of physical principles by deductive knowledge-based mean-force potentials (e.g. 84). Such potentials, as well as more expert-knowledge-oriented approaches (49, 96), enable the detection of subtle stresses or possible errors in both experimentally determined 3D structures and predicted models (85). Knowledge-based potentials of mean-force appear to be valid even for proteins with properties not used for deriving the potentials (membrane proteins, (85); coiled-coils, S O'Donoghue, unpublished). Due to this success, quality control tools using these potentials are becoming a routine check applied to any experimentally determined structure or any structure predicted by homology modelling.

EXPLORING THE DIVERSITY OF PROTEIN FUNCTION BY STRUCTURAL ALIGNMENTS

More and more frequently, a newly determined structure is identified to be remotely homologous to a known structure (38). Recently developed algorithms enable routine scans for possible remote homologues in PDB for any new structure (35, 41, 59, 64, 75, 88). Such searches are beginning to rival sequence database searches as a tool for discovering biologically interesting relationships (38). Similar techniques can often be exploited to determine domains in known structures (27, 37, 87).

CONCLUSION

PREDICTION IN 3D: THEORY BRIDGES THE SEQUENCE-STRUCTURE GAP

Three-dimensional structure cannot yet be predicted reliably from sequence information alone. In other words, the only source for new, unique structures (structures for which no homologue exists in the database) are experiments. However, given the amount of time needed to determine a protein structure experimentally, more non-unique structures can be predicted at atomic resolution by homology modelling in a month than have been determined by experiment over the last three decades. Unfortunately, such models typically have considerable coordinate errors in loop regions, and remote homology modelling (i.e. homology modelling for < 25% pairwise sequence identity, often termed threading) is not yet reliable. But, for a few cases, threading techniques already have resulted in accurate modelling of the overall fold (86).

PREDICTIONS IN 1D: SIGNIFICANT IMPROVEMENT BY LARGER DATABASES

The rich information contained in the growing sequence and structure databases has been used to improve the accuracy of predictions of some aspects of protein structure. Predictions of secondary structure, solvent accessibility and transmembrane helices are becoming increasingly useful. This success is the result of both on a better performance of multiple alignment-based methods and of the ability to focus on more reliably predicted regions. Some methods have indicated that 1D predictions can be useful as an intermediate step on the way to predicting 3D structure (inter-strand contacts; prediction-based threading). Another advantage of predictions in 1D is that they are not very CPU-intensive, i.e., 1D structure can be predicted for the protein sequence of, for example, entire yeast chromosomes overnight.

PREDICTIONS IN 2D: SO FAR OF LIMITED SUCCESS

The prediction accuracy of chain-distant inter-residue contacts is so far relatively limited. Analysis of correlated mutations can be used to distinguish between alternative models (e.g. for threading techniques). The prediction of inter-strand contacts appears to be useful in some cases. An accurate method for the automatic prediction of contacts between residues not close in sequence remains to be developed.

ANALYSIS OF PROTEIN STRUCTURES HAS INCREASING IMPACT ON BIOLOGY

Another encouraging development is the improvement of tools for the analysis of protein structures. Experimental inconsistencies can be spotted, predicted models can be tested. The ease of scanning structure databases for remote homologues yields a rich amount of information with impact on our understanding of protein structure and function.

ACKNOWLEDGEMENT

We are indebted to Kimmen Sjölander (Santa Cruz) who spent an enormous energy on helping us to improve the quality of the manuscript at the same time sacrificing her own work.

Figure captions

Figure 1 Bridging the sequence-structure gap by experiment and theory. The full clock cycle corresponds to all protein sequences stored in SWISSPROT (release 31 with 44,000 sequences). (a) fraction of proteins for which 3D structure has been experimental determined (sequence unique: <25% pairwise sequence identity; structure unique: unique overall fold-type as defined by 36). (b) fraction of proteins for which 3D structure can be predicted by homology modelling (margin for threading hypothetical). Note: unique 3D structures cannot be predicted, yet.

Figure 2 Representation of scorpion neurotoxin (PDB code 2sn3) in one, two, and three dimensions. Each of the representations gives rise to a different type of prediction. (1D) Seq , sequence in one-letter alphabet; Sec , secondary structure, with H for helix, E for strand and blank for other; Acc , relative solvent accessibility (note: integer n codes for a relative accessibility of n ¥ n %). (2D) inter-residue contact-map (sequence positions 1 to 65 plotted from left to right, and from up to down); squares indicate that the respective residue pair is in contact. (3D) the trace of the protein chain in three dimensions is plotted schematically as a ribbon a-carbon trace. The two strands are indicated by arrows, the helix is marked by a cylinder. Graphs were generated using WHAT IF (96).

Figure 3 Summary of the tools available for sequence (a) and structure (b) analysis reviewed here. The arrows indicate the input information used for a given method, e.g., secondary structure can be predicted from single sequences and alignments; the prediction can be used for prediction of inter-strand contacts and threading.

Figure 4 Accuracy of secondary structure prediction for various prediction methods. Abbreviations for methods: RAN and HM , for comparison the results of the worst (random) and the best (homology modelling) possible predictions are given (74); Chou-Fasman , GORIII , and COMBINE : early prediction method based on single sequence information (9, 15, 28) (note, these methods are still widely used by standard sequence analysis packages); LPAG , multiple alignment-based method using statistics (53); NNSSP , multiple alignment-based method using nearest neighbour algorithms (77); PHDsec , multiple alignment-based neural network prediction (72). The groups indicate identical test sets, e.g., GORIII is about eight percentage points less accurate than LPAG using the same algorithm but additionally multiple alignments, and PHDsec is another six percentage points more accurate than LPAG by using neural networks instead of statistics.

Figure 5 Secondary structure prediction accuracy for PHDsec evaluated on 337 protein families. (a) Prediction accuracy varies considerably between protein families. One standard deviation is nine percentage points, so prediction accuracy for most sequences is 63-81%% and the average accuracy is 72%. Because of this significant variation, prediction methods have to be evaluated on a sufficiently large set of unique proteins. (b) Residues with a higher reliability index are predicted with higher accuracy. For example, for 44% of all residues, prediction accuracy is, on average, 88% (dashed line), i.e., comparable to homology modelling if it were applicable. In practice, it is recommended that attention be focused on the most reliably predicted residues.

Figure 6 Two-state accuracy of predicting relative accessibility. Abbreviations for methods: RAN and HM , for comparison the results of the worst (random) and the best (homology modelling) possible predictions are given (73); HMG 1990 , neural network using single sequence input (33); W&B 1994 , multiple alignment based prediction method using rather sophisticated expert rules and statistics (97); PHDacc , multiple alignment-based neural network prediction (73);. The groups indicate identical test sets.

Literature Cited


1.   Abagyan R , Totrov M. 1994. Biased Probability Monte Carlo Conformational Searches and Electrostatic Calculations for Peptides and Proteins. J. Mol. Biol. 235:983-1002
2.   Altschuh D, Vernet T, Moras D , Nagai K. 1988. Coordinated amino acid changes in homologous protein families. Prot. Engin. 2:193-199
3.   Altschul SF. 1993. A protein Alignment Scoring System Sensitive at All Evolutionary Distances. J. Mol. Evol. 36:290-300
4.   Anfinsen CB , Scheraga HA. 1975. Experimental and theoretical aspects of protein folding. Adv. Prot. Chem. 29:205-300
5.   Bairoch A , Boeckmann B. 1994. The SWISS-PROT protein sequence data bank: current status. Nucl. Acids Res. 22:3578-3580
6.   Benner SA, Badcoe I, Cohen MA , Gerloff DL. 1994. Bona Fide Prediction of Aspects of Protein Conformation. J. Mol. Biol. 235:926-58
7.   Benson D, Lipman DJ , Ostell J. 1993. GenBank. Nucl. Acids Res. 21:963-75
8.   Bernstein FC, Koetzle TF, Williams GJB, Meyer EF, Brice MD, et al. 1977. The Protein Data Bank: a computer based archival file for macromolecular structures. J. Mol. Biol. 112:535-42
9.   Biou V, Gibrat JF, Levin JM, Robson B , Garnier J. 1988. Secondary structure prediction: combination of three different methods. Prot. Engin. 2:185-91
10.   Bohr H, Bohr J, Brunak S, Fredholm H, Lautrup B, et al. 1990. A novel approach to prediction of the 3-dimensional structures of protein backbones by neural networks. FEBS Lett. 261:43-6
11.   Bowie JU, Lüthy R , Eisenberg D. 1991. A Method to Identify Protein Sequences That Fold into a Known Three-Dimensional Structure. Science 253:164-9
12.   Braxenthaler M , Sippl M. 1995. Screening genome sequences for known folds. In Protein structure by distance analysis, ed. H Bohr , S Brunak, pp. CRC Press
13.   Brünger AT , Nilges M. 1993. Computational challenges for macromolecular structure determination by X-ray crystallography and solutioin NMR-spectroscopy. Quart. Rev. Biophys. 26:49-125
14.   Bryant SH , Altschul SF. 1995. Statistics of sequence-structure threading. Curr. Opin. Str. Biol. 5:236-44
15.   Chou PY , Fasman GD. 1978. Prediction of the secondary structure of proteins from their amino acid sequence. Adv. Enzymol. 47:45-148
16.   Colloc'h N, Etchebest C, Thoreau E, Henrissat B , Mornon J-P. 1993. Comparison of three algorithms for the assignment of secondary structure in proteins: the advantages of a consensus assignment. Prot. Engin. 6:377-82
17.   Cornell WD, Howard AE , Kollman P. 1991. Molecular mechanical potential functions and their application to study molecular systems. Curr. Opin. Str. Biol. 1:201-12
18.   De Filippis V, Sander C , Vriend G. 1994. Predicting local structural changes that result from point mutations. Prot. Engin. 7:1203-8
19.   Defay T , Cohen FE. 1995. Evaluation of current techniques for ab-initio protein structure prediction. Proteins. In press
20.   Deperieux E , Feytmans E. 1992. MATCH-BOX: a fundamentally new algorithm for the simultaneous alignment of several protein sequences. CABIOS 8:501-9
21.   Doolittle RF. 1994. Convergent evolution: the need to be explicit. TIBS 19:15-8
22.   Eddy SR. 1995. Multiple alignment using hidden Markov models. In Third International converence on Intelligent Systems for Molecular Biology (ISMB), Cambridge, England, eds. C Rawlings, D Clark, R Altman, L Hunter, T Lengauer, et al, pp. 114-120. Menlo Park, CA: AAAI 
23.   Eisenmenger F, Argos P , Abagyan R. 1993. A Method to Configure Protein Side-chains from the Main-chain Trace in Homology Modelling. J. Mol. Biol. 231:849-60
24.   Flores TP, Orengo CA, Moss DS , Thornton JM. 1993. Comparison of conformational characteristics in structurally similar protein pairs. Prot. Sci. 2:1811-26
25.   Gaasterland T , Selkov E. 1995. Reconstruction of metabolic networks using incomplete information. In Third International converence on Intelligent Systems for Molecular Biology (ISMB), Cambridge, England, eds. C Rawlings, D Clark, R Altman, L Hunter, T Lengauer, et al, pp. 127-35. Menlo Park, CA: AAAI 
26.   Galaktionov SG , Marshall GR. 1994. Properties of Intraglobular Contacts in Proteins: An Approach to Prediction of Tertiary Structure. In 27th Hawaii International Conference on System Sciences, Wailea, HI, U.S.A., eds. L Hunter, pp. 326-35. IEEE Computer Society 
27.   Gerstein M, Sonnhammer ELL , Chothia C. 1994. Volume Changes in Protein Evolution. J. Mol. Biol. 236:1067-78
28.   Gibrat J-F, Garnier J , Robson B. 1987. Further Developments of Protein Secondary Structure Prediction Using Information Theory. New Parameters and Consideration of Residue Pairs. J. Mol. Biol. 198:425-43
29.   Goebel U, Sander C, Schneider R , Valencia A. 1994. Correlated mutations and residue contacts in proteins. Proteins 18:309-17
30.   Henikoff S , Henikoff JG. 1993. Performance evaluation of amino acid substitution matrices. Proteins 17:49-61
31.   Henikoff S , Henikoff JG. 1994. Position-based sequence weights. J. Mol. Biol. 243:574-8
32.   Hobohm U , Sander C. 1994. Enlarged representative set of protein structures. Prot. Sci. 3:522-524
33.   Holbrook SR, Muskal SM , Kim S-H. 1990. Predicting surface exposure of amino acids from protein sequence. Prot. Engin. 3:659-65
34.   Holm L, Rost B, Sander C, Schneider R , Vriend G. 1994. Data based modeling of proteins. In Statistical Mechanics, Protein Structure, and Protein Substrate Interactions, New York, eds. S Doniach, pp. 277-96. Plenum 
35.   Holm L , Sander C. 1993. Protein Structure Comparison by Alignment of Distance Matrices. J. Mol. Biol. 233:123-38
36.   Holm L , Sander C. 1994. The FSSP database of structurally aligned protein fold families. Nucl. Acids Res. 22:3600-9
37.   Holm L , Sander C. 1994. Parser for protein folding units. Proteins 19:256-68
38.   Holm L , Sander C. 1994. Searching Protein Structure Databases Has Come of Age. Proteins 19:165-73
39.   Hubbard TJP. 1994. Use of b-strand interaction pseudo-potential in protein structure prediction and modelling. In 27th Hawaii International Conference on System Sciences, Maui, Hawaii, USA, eds. L Hunter, pp. 336-44. IEEE Society 
40.   Hubbard TJP , Park J. 1995. Fold recognition and ab initio sturcture predictions using Hidden Markov models and b-strand pair potentials. Proteins. In press
41.   Johnson MS, Overington JP , Blundell TL. 1993. Alignment and searching for common protein folds using a data bank of structural templates. J. Mol. Biol. 231:735-52
42.   Johnston M, Andrews S, Brinkman R, Cooper J, Ding H, et al. 1994. Complete nucleotide sequence of saccaromyces cerevisiae chromosome VIII. Science 265:2077-82
43.   Jones DT, Taylor WR , Thornton JM. 1992. A new approach to protein fold recognition. Nature 358:86-9
44.   Jones DT, Taylor WR , Thornton JM. 1992. The rapid generation of mutation data matrices from protein sequences. CABIOS 8:275-82
45.   Kabsch W , Sander C. 1983. Dictionary of protein secondary structure: pattern recognition of hydrogen bonded and geometrical features. Biopolymers 22:2577-637
46.   Kabsch W , Sander C. 1984. On the use of sequence homologies to predict protein structure: Identical pentapeptides can have completely different conformations. Proc. Natl. Acad. Sc. U.S.A. 81:1075-8
47.   Karplus M , Petsko GA. 1990. Molecular dynamics simulations in biology. Nature 347:631-9
48.   Krogh A, Brown M, Mian IS, Sjölander K , Haussler D. 1994. Hidden Markov Models in Computational Biology: Applications to Protein Modeling. J. Mol. Biol. 235:1501-31
49.   Laskowski RA, Moss DS , Thornton JM. 1993. Main-chain bond lengths and bond angles in protein structures. J. Mol. Biol. 231:1049-67
50.   Lathrop RH. 1994. The protein threading problem with sequence amino acid interaction preferences is NP-complete. Prot. Engin. 7:1059-68
51.   Lattman EE. 1994. Protein crystallography for all. Proteins 18:103-6
52.   Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, et al. 1993. Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment. Science 262:208-14
53.   Levin JM, Pascarella S, Argos P , Garnier J. 1993. Quantification of Secondary Structure Prediction Improvement Using Multiple Alignments. Prot. Engin. 6:849-54
54.   Levitt M. 1992. Accurate Modeling of Protein Conformation by Automatic Segment Matching. J. Mol. Biol. 226:507-33
55.   Levitt M , Chothia C. 1976. Structural patterns in globular proteins. Nature 261:552-8
56.   Lifson S , Sander C. 1980. Specific recognition in the tertiary structure of b-sheets in proteins. J. Mol. Biol. 139:627-39
57.   Livingstone CD , Barton GJ. 1993. Protein sequence alignments: a strategy for the hierarchical analysis of residue conservation. CABIOS 9:745-56
58.   May ACW , Blundell TL. 1994. Automated comparative modelling of protein structures. Curr. Opin. Biotech. 5:355-60
59.   Mitchell EM, Artymiuk PJ, Rice DW , Willett P. 1992. Use of techniques derived from graph theory to compare secondary structure motifs in proteins. J. Mol. Biol. 212:151-66
60.   Muskal SM, Holbrook SR , Kim S-H. 1990. Prediction of the disulfide-bonding state of cysteine in proteins. Prot. Engin. 3:667-72
61.   Nakashima H , Nishikawa K. 1992. The amino acid composition is different between the cytoplasmic and extracellular sides in membrane proteins. FEBS Lett. 303:141-6
62.   Neher E. 1994. How frequent are correlated changes in families of protein sequences? Proc. Natl. Acad. Sc. U.S.A. 91:98-102
63.   Nilges M. 1993. A Calculation Strategy for the Structure Determination of Symmetric Dimers by 1H NMR. Proteins 17:297-309
64.   Orengo CA, Brown NP , Taylor WT. 1992. Fast structure alignment for protein databank searching. Proteins 14:139-67
65.   Pearson WR , Lipman DJ. 1988. Improved tools for biological sequence comparison. Proc. Natl. Acad. Sc. U.S.A. 85:2444-8
66.   Persson B , Argos P. 1994. Prediction of Transmembrane Segments in Proteins Utilising Multiple Sequence Alignments. J. Mol. Biol. 237:182-92
67.   Rooman M , Wodak SJ. 1988. Identification of predictive sequence motifs limited by protein structure data base size. Nature 335:45-9
68.   Rost B. 1995. TOPITS: Threading One-dimensional Predictions Into Three-dimensional Structures. In Third International Conference on Intelligent Systems for Molecular Biology, Cambridge, England, eds. C Rawlings, D Clark, R Altman, L Hunter, T Lengauer, et al, pp. 314-21. Menlo Park, CA: AAAI 
69.   Rost B, Casadio R, Fariselli P , Sander C. 1995. Prediction of helical transmembrane segments at 95% accuracy. Prot. Sci. 4:521-33
70.   Rost B , Sander C. 1993. Prediction of protein secondary structure at better than 70% accuracy. J. Mol. Biol. 232:584-99
71.   Rost B , Sander C. 1993. Secondary structure prediction of all-helical proteins in two states. Prot. Engin. 6:831-6
72.   Rost B , Sander C. 1994. Combining evolutionary information and neural networks to predict protein secondary structure. Proteins 19:55-72
73.   Rost B , Sander C. 1994. Conservation and prediction of solvent accessibility in protein families. Proteins 20:216-26
74.   Rost B, Sander C , Schneider R. 1994. Redefining the goals of protein secondary structure prediction. J. Mol. Biol. 235:13-26
75.   Russell RB , Barton GJ. 1992. Multiple Protein Sequence Alignment From Tertiary Structure Comparison: Assignment of Global and Residue Confidence Levels. Proteins 14:309-23
76.   Russell RB , Barton GJ. 1993. The limits of protein secondary structure prediction accuracy from multiple sequence alignment. J. Mol. Biol. 234:951-7
77.   Salamov AA , Solovyev VV. 1995. Prediction of protein secondary structure by combining nearest-neighbor algorithms and multiple sequence alignments. J. Mol. Biol. 247:11-5
78.   Sali A , Blundell T. 1994. Comparative Protein Modelling by Satisfaction of Spatial Restraints. In Protein Structure by Distance Analysis, ed. H Bohr , S Brunak, pp. 64-87. Amsterdam, Oxford, Washington: IOS Press
79.   Sander C , Schneider R. 1991. Database of homology-derived structures and the structurally meaning of sequence alignment. Proteins 9:56-68
80.   Sander C , Schneider R. 1994. The HSSP database of protein structure-sequence alignments. Nucl. Acids Res. 22:3597-9
81.   Shindyalov IN, Kolchanov NA , Sander C. 1994. Can three-dimensional contacts in protein structures be predicted by analysis of correlated mutations? Prot. Engin. 7:349-58
82.   Shortle D. 1995. Protein fold recognition. Nature Struct. Biol. 2:91-2
83.   Sipos L , von Heijne G. 1993. Predicting the topology of eukaryotic membrane proteins. Eur. J. Biochem. 213:1333-40
84.   Sippl MJ. 1993. Boltzmann's principle, knowledge based mean fields and protein folding.  An approach to the computational determination of protein structures. J. Comput. Aided Mol. Design 7:473-501
85.   Sippl MJ. 1993. Recognition of errors in three-dimensional structures of proteins. Proteins 17:355-62
86.   Sippl MJ. 1995. Knowledge-based potentials for proteins. Curr. Opin. Str. Biol. 5:229-35
87.   Sternberg MJE, Hegyi H, Islam SA, Luo J , Russell RB. 1995. Towards an intelligent system for the automatic assignment of domains in globular proteins. In Third International converence on Intelligent Systems for Molecular Biology (ISMB), Cambridge, England, eds. C Rawlings, D Clark, R Altman, L Hunter, T Lengauer, et al, pp. 376-83. Menlo Park, CA: AAAI 
88.   Subbiah S, Laurents DV , Levitt M. 1993. Structural similarity of DNA-binding domains of bacterophage repressors and the globin core. Curr. Biol. 3:141-8
89.   Summers NL , Karplus M. 1990. Modeling of Globular Proteins. J. Mol. Biol. 216:991-1016
90.   Taylor WR , Hatrick K. 1994. Compensating changes in protein multiple sequence alignments. Prot. Engin. 7:341-8
91.   Taylor WR, Jones DT , Green NM. 1994. A Method for a-Helical Integral Membrane Protein Fold Prediction. Proteins 18:281-94
92.   Thompson JD, Higgins DG , Gibson TJ. 1994. Improved sensitivity of profile searches through the use of sequence weights and gab excision. CABIOS 10:19-29
93.   van Gunsteren WF. 1993. Molecular dynamics studies of proteins. Curr. Opin. Str. Biol. 3:167-74
94.   Vingron M , Waterman MS. 1994. Sequence alignment and penalty choice. J. Mol. Biol. 235:1-12
95.   von Heijne G. 1992. Membrane Protein Structure Prediction. J. Mol. Biol. 225:487-94
96.   Vriend G , Sander C. 1993. Quality of Protein Models: Directional Atomic Contact Analysis. J. Appl. Cryst. 26:47-60
97.   Wako H , Blundell TL. 1994. Use of amino acid environment-dependent substitution tables and conformational propensities in structure prediction from aligned sequences of homologous proteins I. Solvent accessibility classes. J. Mol. Biol. 238:682-92
98.   Wodak SJ , Rooman MJ. 1993. Generating and testing protein folds. Curr. Opin. Str. Biol. 3:247-59
99.   Zhu Z-Y. 1995. A new approach to the evaluation of protein secondary structure predictions at the level of the elements of secondary structure. Prot. Engin. 8:103-8

Abbreviations used:

3D, three-dimensional; 1D, one-dimensional; BLAST, fast sequence alignment program; CPU, central processing unit, i.e., 'core' of computer; DSSP, data base containing the secondary structure and solvent accessibility for proteins of known 3D structure; FASTA, fast sequence alignment program; MAXHOM, profile-based multiple sequence alignment program that also runs on parallel computers; PDB, Protein Data Bank of experimentally determined 3D structures of proteins; PHD, Profile-based neural network prediction of secondary structure (PHDsec), solvent accessibility (PHDacc), and transmembrane helices (PHDhtm); SWISSPROT, data base of protein sequences; U, protein sequence of unknown 3D structure (e.g. search sequence in alignment procedure); WHATIF, molecular graphics package with modules for homology modelling, drug design and protein structure analysis.

Bridging the protein sequence-structure gap by structure predictions

Rost, Burkhard & Sander, Chris

Annual Review of Biophysics and Biomolecular Structure, 25, 1995, in press.

Text

CONTENTS

INTRODUCTION

SEQUENCE-STRUCTURE GAP

PREDICTION OF PROTEIN STRUCTURE IN 1D, 2D, AND 3D

SEQUENCE ALIGNMENTS

EVOLUTION DISTINGUISHES SIGNAL FROM NOISE

MULTIPLE ALIGNMENTS IMPROVE AS DATABASES GROW

DRAWBACK: LACK OF SUFFICIENTLY TESTED CUT-OFF CRITERIA

EVALUATION OF PREDICTION METHODS

PUBLISHING OPTIMISTIC RESULTS?

REQUIREMENTS FOR SUFFICIENT EVALUATION OF PREDICTION METHODS

NUMBER OF CROSS-VALIDATION EXPERIMENTS OF NO MEANING

PREDICTION IN 1D

Secondary structure

BASIC CONCEPT

USE OF EVOLUTIONARY INFORMATION IN SEQUENCES IS THE KEY TO IMPROVED SECONDARY STRUCTURE PREDICTIONS

RECENT IMPROVEMENTS TO SECONDARY STRUCTURE PREDICTION MAKE THESE METHODS OF PRACTICAL USE

PREDICTION OF SECONDARY STRUCTURE CONTENT IS NOT VERY USEFUL

NO SIGNIFICANT GAIN BY SPECIALISING ON ALL-a PROTEINS

Solvent accessibility

BASIC CONCEPT

EVOLUTIONARY INFORMATION IMPROVES PREDICTION ACCURACY

Transmembrane helices

BASIC CONCEPT

EVOLUTIONARY INFORMATION IMPROVES PREDICTION ACCURACY

PREDICTION OF 3D STRUCTURE FOR HELICAL TRANSMEMBRANE PROTEINS?

PREDICTION IN 2D

Inter-residue contacts

PREDICTION PROBLEM IS A HARD ONE, BUT THE STAKES ARE HIGH

CORRELATED MUTATIONS CAN IMPLY SPATIAL PROXIMITY

DISTINCTION BETWEEN DIFFERENT MODELS, NO PREDICTION OF 3D, YET

Inter-strand contacts

SIMPLIFYING THE CONTACT PREDICTION PROBLEM

IDENTIFYING THE CORRECT b-STRAND ALIGNMENT

USING EVOLUTIONARY INFORMATION TO PREDICT INTER-STRAND CONTACTS

Inter-cysteine contacts

A VERY SIMPLE CONTACT PREDICTION PROBLEM

SO FAR ONLY PREDICTION IN 1D

PREDICTION IN 3D

Homology modelling

BASIC CONCEPT

HIGH LEVEL OF SEQUENCE IDENTITY: ATOMIC RESOLUTION

LOW LEVEL OF SEQUENCE IDENTITY: LOOP REGIONS SOMETIMES CORRECT

Remote homology modelling (threading)

BASIC CONCEPT

SOMETIMES REMOTE HOMOLOGUES ARE CORRECTLY IDENTIFIED

THREADING IS USEFUL FOR SCEPTICAL USERS

ANALYSIS OF 3D STRUCTURES

SUCCESSFUL IDENTIFICATION OF NATIVE-LIKE 3D STRUCTURES

EXPLORING THE DIVERSITY OF PROTEIN FUNCTION BY STRUCTURAL ALIGNMENTS

CONCLUSION

PREDICTION IN 3D: THEORY BRIDGES THE SEQUENCE-STRUCTURE GAP

PREDICTIONS IN 1D: SIGNIFICANT IMPROVEMENT BY LARGER DATABASES

PREDICTIONS IN 2D: SO FAR OF LIMITED SUCCESS

ANALYSIS OF PROTEIN STRUCTURES HAS INCREASING IMPACT ON BIOLOGY

ACKNOWLEDGEMENT

Figure captions

Literature Cited

Abbreviations used: