PP Help 08: Methods

Contents

Definition of family

MaxHom : multiple alignment
PSI-BLAST : iterated profile search
ProSite : functional motifs
SEG : composition-bias
ProDom : domain assignment
PredictNLS : nuclear localisation signal
PHDsec : secondary structure
PHDacc : solvent accessibility
Globe : globularity of proteins
PHDhtm : transmembrane helices
PROFsec : secondary structure
PROFacc : solvent accessibility
Coils : coiled-coil regions
Cyspred : cysteine bridges
ASP : structural switches
Topits : fold recognition by threading

Family = structural family, i.e. all proteins in family have similar structures
We search with your input sequence against the specified sequence database (SWISS-PROT by default). All proteins P which have a level of sequence similarity to your query protein Q that allows to predict that P and Q have a similar three-dimensional structure are returned in the MaxHom alignment. The iterated PSI-BLAST frequently finds more diverged proteins P2 that also have a similar structure as Q.

Functional homology
Iin general, much higher levels of sequence similarity are required to infer particular aspects of function. Thus, the members of a protein family may or may not share particular functional motifs.

PREDICTION METHODS

The protein database (currently SWISS-PROT) is searched by a fast alignment program (currently BLASTP).

In sweep 1, sequences are aligned consecutively to the search sequence by a standard dynamic programming method. After each sequence has been added a profile is compiled, and used to align the next sequence.

In sweep 2, after all sequences with significant homology have been picked from the BLASTP output, the profile is recompiled, and the dynamic programming algorithm starts once again to align consecutively the sequences, this time using the conservation profile as derived after completion of sweep 1.

Iterated profile-based search (PSI-BLAST ; more info)

PSIblast is a fast, yet sensitive database search program.
We are running the iterated PSI-BLAST on a subset of the BIG database with SWISS-PROT + TrEMBL + PDB sequences. The number of iteration, the cut-off thresholds and the particular details of which sequences are used from BIG has been optimised in our group.

Functional sequence motifs (ProSite; example for output; more info)

The following description is from the original ProSite site:
ProSite is a method of determining what is the function of uncharacterized proteins translated from genomic or cDNA sequences. It consists of a database of biologically significant sites, patterns and profiles that help to reliably identify to which known family of protein (if any) a new sequence belongs.

Low-complexity regions (SEG; example for output; more info)

The following description is from the original SEG documentation (JC Wootton & S Federhen, 1996, Meth Enzymology, 266, 554-571):
SEG divides sequences into contrasting segments of low-complexity and high-complexity. Low-complexity segments defined by the algorithm represent "simple sequences" or "compositionally-biased regions".
Locally-optimized low-complexity segments are produced at defined levels of stringency, based on formal definitions of local compositional complexity. The segment lengths and the number of segments per sequence are determined automatically by the algorithm.

Domain assignment (ProDom; example for output; more info)

The following description is from the original ProDom site (which supplies a rather useful graphical interface to the ProDom database):
The ProDom protein domain database consists of an automatic compilation of homologous domains detected in the SWISS-PROT database by the DOMAINER algorithm (ELL Sonnhammer & D Kahn, Prot. Sci., 1994, 3, 482-492). It has been devised to assist with the analysis of the domain arrangement of proteins.
ProDom `domains' are inferred on the basis of conserved subsequences as found in various proteins. Such a conservation corresponds frequently, though not always, to genuine structural domains: therefore domain boundaries should be treated with caution. For some domain families experts have been asked to correct domain boundaries on the basis of both sequence and structural information. This expertise will complement the automated process and improve the quality of ProDom domain families.

Prediction of nuclear localisation signal (PredictNLS; example for output; more info)

PredictNLS finds experimentally known nuclear localisation signals present in your protein. The program produces an output if and only if a known NLS was found.
Note that the original version of the program at http://cubic.bioc.columbia.edu/predictNLS also allows you to obtain statistics for putative NLS motifs.

Secondary structure (PHDsec; more info)

Secondary structure is predicted by a system of neural networks rating at an expected average accuracy > 72% for the three states helix, strand and loop (Rost & Sander, PNAS, 1993 , 90, 7558-7562; Rost & Sander, JMB, 1993 , 232, 584-599; and Rost & Sander, Proteins, 1994 , 19, 55-72; evaluation of accuracy). Evaluated on the same data set, PHDsec is rated at ten percentage points higher three-state accuracy than methods using only single sequence information, and at more than six percentage points higher than, e.g., a method using alignment information based on statistics (Levin, Pascarella, Argos & Garnier, Prot. Engng., 6, 849-54, 1993).
PHDsec predictions have three main features:

improved accuracy through evolutionary information from multiple sequence alignments
improved beta-strand prediction through a balanced training procedure
more accurate prediction of secondary structure segments by using a multi-level system

Solvent accessibility (PHDacc; more info)

Solvent accessibility is predicted by a neural network method rating at a correlation coefficient (correlation between experimentally observed and predicted relative solvent accessibility) of 0.54 cross-validated on a set of 238 globular proteins (Rost & Sander, Proteins, 1994, 20, 216-226; evaluation of accuracy). The output of the neural network codes for 10 states of relative accessibility. Expressed in units of the difference between prediction by homology modelling (best method) and prediction at random (worst method), PHDacc is some 26 percentage points superior to a comparable neural network using three output states (buried, intermediate, exposed) and using no information from multiple alignments.

Globularity of proteins (GLOBE; more info)

An additional result from the prediction of solvent accessibility is that of protein globularity. That method is not published, yet. For more information, you may have a look at the preliminary preprint.

Transmembrane helices (PHDhtm; example for output; more info)

Transmembrane helices in integral membrane proteins are predicted by a system of neural networks. The shortcoming of the network system is that often too long helices are predicted. These are cut by an empirical filter. The final prediction (Rost et al., Protein Science, 1995, 4, 521-533; evaluation of accuracy) has an expected per-residue accuracy of about 95%. The number of false positives, i.e., transmembrane helices predicted in globular proteins, is about 2% (Rost et al. 1996).
The neural network prediction of transmembrane helices (PHDhtm) is refined by a dynamic programming-like algorithm. This method resulted in correct predictions of all transmembrane helices for 89% of the 131 proteins used in a cross-validation test; more than 98% of the transmembrane helices were correctly predicted. The output of this method is used to predict topology, i.e., the orientation of the N-term with respect to the membrane. The expected accuracy of the topology prediction is > 86%. Prediction accuracy is higher than average for eukaryotic proteins and lower than average for prokaryotes. PHDtopology is more accurate than all other methods tested on identical data sets (Rost, Casadio & Fariselli, 1996a and 1996b; evaluation of accuracy).

Secondary structure (PROFsec; more info)

Secondary structure is predicted by a system of neural networks rating at an expected average accuracy > 78% for the three states helix, strand and loop (Rost, 2000, unpublished). Evaluated on the same data set, PROFsec is rated at 6-8 percentage points higher three-state accuracy than PHDsec.

Solvent accessibility (PROFacc; more info)

Solvent accessibility is predicted by a system of neural networks rating at an expected average accuracy > 78% for the two states exposed and buried (Rost, 2000, unpublished). Evaluated on the same data set, PROFacc is rated at about five percentage points higher two-state accuracy than PHDacc.

Coiled-coil regions (COILS; example for output; more info)

The following description is from the original COILS site:
COILS is a program that compares a sequence to a database of known parallel two-stranded coiled-coils and derives a similarity score. By comparing this score to the distribution of scores in globular and coiled-coil proteins, the program then calculates the probability that the sequence will adopt a coiled-coil conformation.

Cysteine bridges (CYSPRED; example for output; more info)

CYSPRED finds whether the cys residue in your protein forms disulfide bridge.
The following description is from the original CYSPRED publication:
A neural network-based predictor is trained to distinguish the bonding states of cysteine in proteins starting from the residue chain. Training is performed using 2452 cysteine-containing segments extracted from 641 non homologous proteins of well resolved 3D structure. After a cross-validation procedure efficiency of the prediction scores as high as 72% when the predictor is trained using protein single sequences. The addition of evolutionary information in the form of multiple sequence alignment and a jury of neural networks increase the prediction efficiency up to 81%. Assessment of the goodness of the prediction with a reliability index indicates that more than 60% of the predictions have an accuracy level greater than 90%. A comparison with a statistical method previously described and tested on the same data base shows that the neural network-based predictor is performing with the highest efficiency.

Structural switches (ASP; example for output; more info)

ASP identifies amino acid subsequences that are the most likely to switch between different types of secondary structure. The program was developed by MM Young, K Kirshenbaum, KA Dill and S Highsmith. ASP was designed to identify the location of conformational switches in proteins with known switches. It is NOT designed to predict whether a given sequence does or does not contain a switch. For best results, ASP should be used on sequences of length >150 amino acids with >10 sequence homologues in the SWISS-PROT data bank. ASP has been validated against a set of globular proteins and may not be generally applicable. Please see Young et al., Protein Science 8(9):1752-64. 1999. and Kirshenbaum et al., Protein Science 8(9):1806-1815. 1999. for details and for how best to interpret this output. We consider ASP to be experimental at this time, and would appreciate any feedback from our users.

Fold recognition by prediction-based threading (TOPITS; examples for: request and output; more info)

Remote homologues (0-25% sequence identity) are detected by a novel prediction-based threading method (Rost 1995a and 1995b). The principle idea is to detect similar motifs of secondary structure and accessibility between a sequence of unknown structure and a known fold . For the recognition of similarities between entire folds, the expected accuracy (first hit of alignment list correct) is about 60% (Rost, ISMB95 Proceedings, 1995, AAAI Press, 314-321). If the goal is to correctly detect even short homologous fragments, still about 30% of the first hits are correct (compared to an accuracy of 14% for simple sequence alignments: full paper). Hits with z-scores above 3.0 are more reliable (accuracy > 60%). (Note: a threading service based on similar principles is provided by Daniel Fischer (http://www.mbi.ucla.edu/people/frsvr/frsvr.html).

Previous - Next - Top - PP home - PP help TOC