Accuracy of PHDacc
PHDacc: solvent accessibility prediction

****************************************************************************
*                                                                          *
*                                                                          *
*      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                                   *
*      PredictProtein@EMBL-Heidelberg.DE                                   *
*      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                                   *
*                                                                          *
*      Solvent accessibility prediction by PHD:                            *
*      a Profile fed neural network system from HeiDelberg                 *
*      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                 *
*                                                                          *
*      Authors:            Burkhard Rost & Chris Sander                    *
*                          EMBL, Heidelberg, FRG                           *
*                          Meyerhofstrasse 1, 69 117 Heidelberg            *
*                          Internet: Predict-Help@EMBL-Heidelberg.DE       *
*                                                                          *
*      All rights reserved.                                                *
*                                                                          *
*                                                                          *
****************************************************************************
*                                                                          *
*  About the network method                                                *
*  ~~~~~~~~~~~~~~~~~~~~~~~~                                                *
*                                                                          *
*  To be quoted for publications of PHDacc output:                         *
*     B Rost & C Sander: Conservation and prediction of solvent accessibi- *
*        lity in protein families. Proteins, 1994, 20, 216-226. (Abstract) *
*                                                                          *
*  The PredictProtein mail server is described in:                         *
*     B Rost:  PHD: predicting one-dimensional  protein structure by pro-  *
*        file based neural networks. Meth. in Enzym., 1996, 266, 525-539.  *
*        (Text)                                                            *
*                                                                          *
*  The network for prediction of secondary structure is described in       *
*  detail in:                                                              *
*     B Rost & C Sander: Prediction of protein structure at better than    *
*        70% accuracy. J. Mol. Biol., 1993, 232, 584-599. (Abstract)       *
*     B Rost & C Sander:  Combining evolutionary information and neural    *
*        networks to predict protein secondary struct. Proteins, 1994, 19, *
*        55-77. (Abstract)                                                 *
*                                                                          *
*                                                                          *
****************************************************************************
*                                                                          *
*  About the input to the network                                          *
*  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                                         *
*                                                                          *
*  The prediction is performed by a system of neural networks.             *
*  The input is a multiple sequence alignment. It is taken from an HSSP    *
*  file (produced by the program MaxHom:                                   *
*     Sander, Chris & Schneider, Reinhard: Database of Homology-Derived    *
*     Structures and the Structural Meaning of Sequence Alignment.         *
*     Proteins, Vol.9, 1991, pp. 56-68.                                    *
*                                                                          *
*  For optimal results the alignment should contain sequences with varying *
*  degrees of sequence similarity relative to the input protein.           *
*  The following is an ideal situation:                                    *
*                                                                          *
*  +-----------------+----------------------+                              *
*  |   sequence:     |  sequence identity   |                              *
*  +-----------------+----------------------+                              *
*  | target sequence |  100 %               |                              *
*  | aligned seq. 1  |   90 %               |                              *
*  | aligned seq. 2  |   80 %               |                              *
*  |      ...        |   ...                |                              *
*  | aligned seq. 7  |   30 %               |                              *
*  +-----------------+----------------------+                              *
*                                                                          *
****************************************************************************
*                                                                          *
*  Definition of accessibility                                             *
*  ~~~~~~~~~~~~~~~~~~~~~~~~~~~                                             *
*                                                                          *
*  For training the residue solvent accessibility the DSSP (Dictionary of  *
*  Secondary Structure of Proteins; Kabsch & Sander (1983) Biopolymers, 22,*
*  2577-2637) values of accessible surface area have been used.  The       *
*  prediction provides values for the relative solvent accessibility.  The *
*  normalisation is the following:                                         *
*                                                                          *
*                             ACCESSIBILITY (from DSSP in Angstrom)        *
*  RELATIVE_ACCESSIBILITY =   ------------------------------------- * 100  *
*                                 MAXIMAL_ACC (amino acid type i)          *
*                                                                          *
*  where MAXIMAL_ACC (i) is the maximal accessibility of amino acid type i.*
*  The maximal values are:                                                 *
*                                                                          *
*  +----+----+----+----+----+----+----+----+----+----+----+----+           *
*  |  A |  B |  C |  D |  E |  F |  G |  H |  I |  K |  L |  M |           *
*  | 106| 160| 135| 163| 194| 197|  84| 184| 169| 205| 164| 188|           *
*  +----+----+----+----+----+----+----+----+----+----+----+----+           *
*  |  N |  P |  Q |  R |  S |  T |  V |  W |  X |  Y |  Z |                *
*  | 157| 136| 198| 248| 130| 142| 142| 227| 180| 222| 196|                *
*  +----+----+----+----+----+----+----+----+----+----+----+                *
*                                                                          *
*  Notation: one letter code for amino acid, B stands for D or N; Z stands *
*     for E or Q; and X stands for undetermined.                           *
*                                                                          *
*  The relative solvent accessibility can be used to estimate the number   *
*  of water molecules (W) in contact with the residue:                     *
*                                                                          *
*  W = ACCESSIBILITY /10                                                   *
*                                                                          *
*  The prediction is given in 10 states for relative accessibility, with   *
*                                                                          *
*  RELATIVE_ACCESSIBILITY = (PREDICTED_ACC * PREDICTED_ACC)                *
*                                                                          *
*  where PREDICTED_ACC = 0 - 9.                                            *
*                                                                          *
*                                                                          *
****************************************************************************
*                                                                          *
*                                                                          *
*  Estimated Accuracy of Prediction                                        *
*  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                                       *
*                                                                          *
*  A careful cross validation test on some 238 protein chains (in total    *
*  about 62,000 residues) with less than 25% pairwise sequence identity    *
*  gave the following results:                                             *
*                                                                          *
*                                                                          *
*  Correlation                                                             *
*  ...........                                                             *
*                                                                          *
*  The correlation between observed and predicted solvent accessibility    *
*  is:                                                                     *
*                                                                          *
*  -----------                                                             *
*  corr = 0.53                                                             *
*  -----------                                                             *
*                                                                          *
*  This value ought to be compared to the worst and best case prediction   *
*  scenario: random prediction (corr = 0.0) and homology modelling         *
*  (corr = 0.66).  (Note: homology modelling yields a relative accurate    *
*  prediction in 3D if, and only if, a significantly identical sequence    *
*  has a known 3D structure.)                                              *
*                                                                          *
*                                                                          *
*  3-state accuracy                                                        *
*  ................                                                        *
*                                                                          *
*  Often the relative accessibility is projected onto, e.g., 3 states:     *
*     b  = buried       (here defined as < 9% relative accessibility),     *
*     i  = intermediate ( 9% <= rel. acc. < 36% ),                         *
*     e  = exposed      ( rel. acc. >= 36% ).                              *
*                                                                          *
*  A projection onto 3 states or 2 states (buried/exposed) enables the     *
*  compilation of a 3- and 2-state prediction accuracy.  PHD reaches an    *
*  overall 3-state accuracy of:                                            *
*     Q3 = 57.5%                                                           *
*  (compared to 35% for random prediction and 70% for homology modelling). *
*                                                                          *
*  In detail:                                                              *
*                                                                          *
*  +-----------------------------------+-------------------------+         *
*  | Qburied       (% of observed)=77% | Qb (% of predicted)=60% |         *
*  | Qintermediate (% of observed)= 9% | Qi (% of predicted)=44% |         *
*  | Qexposed      (% of observed)=78% | Qe (% of predicted)=56% |         *
*  +-----------------------------------+-------------------------+         *
*                                                                          *
*                                                                          *
*  10-state accuracy                                                       *
*  .................                                                       *
*                                                                          *
*  The network predicts relative solvent accessibility in 10 states, with  *
*  state i (i = 0-9) corresponding to a relative solvent accessibility of  *
*  i*i %.  The 10-state accuracy of the network is:                        *
*                                                                          *
*     Q10 = 24.5%                                                          *
*                                                                          *
*..........................................................................*
*                                                                          *
*  These percentages are defined by:                                       *
*  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                                       *
*                                                                          *
*                       number of correctly predicted residues             *
*  Q3 		      = ---------------------------------------      (*100) *
*                             number of all residues                       *
*                                                                          *
*                       no of res. correctly predicted to be buried        *
*  Qburied (% of obs) = ------------------------------------------- (*100) *
*                       no of all res. observed to be buried               *
*                                                                          *
*                                                                          *
*                       no of res. correctly predicted to be buried        *
*  Qburied (% of pred)= ------------------------------------------- (*100) *
*                       no of all residues predicted to be buried          *
*                                                                          *
*..........................................................................*
*                                                                          *
*  Averaging over single chains                                            *
*  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~                                            *
*                                                                          *
*  The most reasonable way to compute the overall accuracies is the above  *
*  quoted percentage of correctly predicted residues.  However, since the  *
*  user is mainly interested in the expected performance of the prediction *
*  for a particular protein, the mean value when averaging over protein    *
*  chains might be of help as well.  Computing first the correlation       *
*  between observed and predicted accessibility for each protein chan, and *
*  then averaging over all 238 chains yields the following average:        *
*                                                                          *
*  +-------------------------------====--+                                 *
*  | corr/averaged over chains   = 0.53  |                                 *
*  +-------------------------------====--+                                 *
*  | standard deviation          = 0.11  |                                 *
*  +-------------------------------------+                                 *
*                                                                          *
*..........................................................................*
*                                                                          *
*  Further details of performance accuracy                                 *
*  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                                 *
*                                                                          *
*  The accuracy matrix in detail:                                          *
*  ..............................                                          *
*                                                                          *
* -------+----------------------------------------------------+----------- *
*  \ PHD |    0    1   2   3    4    5     6     7    8    9  |  SUM  %obs *
* -------+----------------------------------------------------+----------- *
* OBS  0 | 8611  140   8  44   82  169   772   334   27    0  | 10187 16.6 *
* OBS  1 | 4367  164   0  50  106  231   738   346   44    3  |  6049  9.8 *
* OBS  2 | 3194  168   1  68  125  303   951   513   42    7  |  5372  8.7 *
* OBS  3 | 2760  159   8  80  136  327  1246   746   58   19  |  5539  9.0 *
* OBS  4 | 2312  144   2  72  166  396  1615  1245  124   19  |  6095  9.9 *
* OBS  5 | 1873   96   3  84  138  425  1979  1834  187   27  |  6646 10.8 *
* OBS  6 | 1387   67   1  60   80  278  2237  2627  231   51  |  7019 11.4 *
* OBS  7 | 1082   35   0  32   56  225  1871  3107  302   60  |  6770 11.0 *
* OBS  8 |  660   25   0  27   43  136  1206  2374  325   87  |  4883  7.9 *
* OBS  9 |  325   20   2  27   29   74   648  1159  366  214  |  2864  4.7 *
* -------+----------------------------------------------------+----------- *
* SUM    |26571 1018  25 544  961 2564 13263 14285 1706  487  |            *
* %pred  | 43.3  1.7 0.0 0.9  1.6  4.2  21.6  23.3  2.8  0.8  |            *
* -------+----------------------------------------------------+----------- *
*                                                                          *
*  Note: This table is to be read in the following manner:                 *
*        8611 of all residues predicted to be in exposed by 0%, were       *
*        observed with 0% relative accessibility.  However, 325 of all     *
*        residues predicted to have 0% are observed as completely exposed  *
*        (obs = 9 -> rel. acc. >= 81%).  The term "observed" refers to the *
*        DSSP compilation of area of solvent accessibility calculated from *
*        3D coordinates of experimentally determined structures (Diction-  *
*        ary of Secondary Structure  of Proteins: Kabsch & Sander (1983)   *
*        Biopolymers, 22, 2577-2637).                                      *
*                                                                          *
*                                                                          *
*  Accuracy for each amino acid:                                           *
*  .............................                                           *
*                                                                          *
*  +---+------------------------------+-----+-------+------+               *
*  |AA |   Q3 b%o b%p i%o i%p e%o e%p | Q10 |  corr |    N |               *
*  +---+------------------------------+-----+-------+------+               *
*  | A | 59.0  87  60   2  38  66  57 |  31 | 0.530 | 5054 |               *
*  | C | 62.0  91  67   5  39  25  21 |  34 | 0.244 |  893 |               *
*  | D | 56.5  21  45   6  49  94  57 |  20 | 0.321 | 3536 |               *
*  | E | 60.8   9  40   3  41  98  61 |  21 | 0.347 | 3743 |               *
*  | F | 63.3  94  67   9  46  29  37 |  27 | 0.366 | 2436 |               *
*  | G | 52.1  75  51   1  31  67  53 |  22 | 0.405 | 4787 |               *
*  | H | 50.9  63  53  23  45  71  50 |  18 | 0.442 | 1366 |               *
*  | I | 64.9  95  68   6  41  30  38 |  34 | 0.360 | 3437 |               *
*  | K | 66.6   2  11   2  37  98  67 |  23 | 0.267 | 3652 |               *
*  | L | 61.6  93  65   8  44  31  40 |  31 | 0.368 | 5016 |               *
*  | M | 60.1  92  64   5  39  45  44 |  29 | 0.452 | 1371 |               *
*  | N | 55.5  45  45   8  38  87  59 |  17 | 0.410 | 2923 |               *
*  | P | 53.0  48  48   9  39  83  56 |  18 | 0.364 | 2920 |               *
*  | Q | 54.3  27  44   7  44  92  56 |  20 | 0.344 | 2225 |               *
*  | R | 49.9  15  47  36  47  76  51 |  18 | 0.372 | 2765 |               *
*  | S | 55.6  69  53   3  51  81  56 |  22 | 0.464 | 3981 |               *
*  | T | 51.8  61  51   8  38  78  53 |  21 | 0.432 | 3740 |               *
*  | V | 61.1  93  65   5  40  39  42 |  34 | 0.418 | 4156 |               *
*  | W | 56.2  85  62  20  49  29  27 |  21 | 0.318 |  891 |               *
*  | Y | 49.7  73  52  33  49  36  38 |  19 | 0.359 | 2301 |               *
*  +---+------------------------------+-----+-------+------+               *
*                                                                          *
*  Abbreviations:                                                          *
*                                                                          *
*  AA:   amino acid in one-letter code                                     *
*  b%o, i%o, e%o:   = Qburied, Qintermediate, Qexposed (% of observed),    *
*        i.e. percentage of correct prediction in each state, see above    *
*  b%p, i%p, e%p:   = Qburied, Qintermediate, Qexposed (% of predicted),   *
*        i.e. probability of correct prediction in each state, see above   *
*  b%o:  = Qburied (% of observed), see above                              *
*  Q10:  percentage of correctly predicted residues in each of the 10      *
*        states of predicted relative accessibility.                       *
*  corr: correlation between predicted and observed rel. acc.              *
*  N:    number of residues in data set                                    *
*                                                                          *
*                                                                          *
*  Accuracy for different secondary structure:                             *
*  ...........................................                             *
*                                                                          *
*  +--------+------------------------------+----+-------+-------+          *
*  | type   |   Q3 b%o b%p i%o i%p e%o e%p |Q10 |  corr |     N |          *
*  +--------+------------------------------+----+-------+-------+          *
*  | helix  | 59.5  79  64   8  44  80  56 | 27 | 0.574 | 20100 |          *
*  | strand | 61.3  84  73   9  46  69  37 | 35 | 0.524 | 13356 |          *
*  | loop   | 54.4  64  43  11  44  78  61 | 18 | 0.442 | 27968 |          *
*  +--------+------------------------------+----+-------+-------+          *
*                                                                          *
*  Abbreviations as before.                                                *
*                                                                          *
*                                                                          *
****************************************************************************
*                                                                          *
*                                                                          *
*  Position-specific reliability index                                     *
*  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                                     *
*                                                                          *
*  The network predicts the 10 states for relative accessibility using real*
*  numbers from the output units. The prediction is assigned by choosing   *
*  the maximal unit ("winner takes all").  However, the real numbers       *
*  contain additional information.                                         *
*  E.g. the difference between the maximal and the second largest output   *
*  unit (with the constraint that the second largest output is compiled    *
*  among all units at least 2 positions off the maximal unit) can be used  *
*  to derive a "reliability index".  This index is given for each residue  *
*  along with the prediction.  The index is scaled to have values between  *
*  0 (lowest reliability), and 9 (highest).                                *
*  The accuracies (Q3, corr, asf.) to be expected for residues with values *
*  above a particular value of the index are given below as well as the    *
*  fraction of such residues (%res).:                                      *
*                                                                          *
*  +---+------------------------------+----+-------+-------+               *
*  |RI |   Q3 b%o b%p i%o i%p e%o e%p |Q10 |  corr |  %res |               *
*  +---+------------------------------+----+-------+-------+               *
*  | 0 | 57.5  77  60   9  44  78  56 | 24 | 0.535 | 100.0 |               *
*  | 1 | 59.1  76  63   9  45  82  57 | 25 | 0.560 |  91.2 |               *
*  | 2 | 61.7  79  66   4  47  87  58 | 27 | 0.594 |  77.1 |               *
*  | 3 | 66.6  87  70   1  51  89  63 | 30 | 0.650 |  57.1 |               *
*  | 4 | 70.0  89  72   0  83  91  67 | 32 | 0.686 |  45.8 |               *
*  | 5 | 72.9  92  75   0   0  93  70 | 34 | 0.722 |  35.6 |               *
*  | 6 | 76.3  95  77   0   0  93  75 | 36 | 0.769 |  24.7 |               *
*  | 7 | 79.0  97  79   0   0  93  78 | 39 | 0.803 |  16.0 |               *
*  | 8 | 80.9  98  80   0   0  91  81 | 43 | 0.824 |   9.6 |               *
*  | 9 | 81.2  99  80   0   0  88  83 | 45 | 0.828 |   5.9 |               *
*  +---+------------------------------+----+-------+-------+               *
*                                                                          *
*  Abbreviations as before.                                                *
*                                                                          *
*  The above table gives the cumulative results, e.g. 45.8% of all         *
*  residues have a reliability of at least 4.  The correlation for this    *
*  most reliably predicted half of the residues is 0.686, i.e. a value     *
*  comparable to what could be expected if homology modelling were         *
*  possible.  For this subset of 45.8% of all residues, 89% of the buried  *
*  residues are correctly predicted, and 72% of all residues predicted to  *
*  be buried are correct.                                                  *
*                                                                          *
*..........................................................................*
*                                                                          *
*  The following table gives the non-cumulative quantities, i.e. the       *
*  values per reliability index range.  These numbers answer the question: *
*  how reliable is the prediction for all residues labeled with the        *
*  particular index i.                                                     *
*                                                                          *
*  +---+------------------------------+----+-------+-------+               *
*  |RI |   Q3 b%o b%p i%o i%p e%o e%p |Q10 |  corr |  %res |               *
*  +---+------------------------------+----+-------+-------+               *
*  | 0 | 40.9  79  40  16  41  21  40 | 14 | 0.175 |   8.8 |               *
*  | 1 | 45.4  61  46  28  44  48  44 | 17 | 0.278 |  14.1 |               *
*  | 2 | 47.4  53  52  10  46  80  44 | 19 | 0.343 |  19.9 |               *
*  | 3 | 52.9  75  59   4  50  77  47 | 23 | 0.439 |  11.4 |               *
*  | 4 | 60.0  81  63   0  83  84  56 | 25 | 0.547 |  10.1 |               *
*  | 5 | 65.2  82  70   0   0  93  62 | 28 | 0.607 |  10.9 |               *
*  | 6 | 71.3  90  72   0   0  94  70 | 31 | 0.692 |   8.8 |               *
*  | 7 | 76.0  94  76   0   0  95  75 | 34 | 0.762 |   6.3 |               *
*  | 8 | 80.5  97  81   0   0  94  79 | 39 | 0.808 |   3.8 |               *
*  | 9 | 81.2  99  80   0   0  88  83 | 45 | 0.828 |   5.9 |               *
*  +---+------------------------------+----+-------+-------+               *
*                                                                          *
*  For example, for residues with RI = 4 83% of all predicted intermediate *
*  residues are correctly predicted as such.                               *
*                                                                          *
*                                                                          *
****************************************************************************