Abstract. Based on a principal component analysis of 47 published attempts to quantify hydrophobicity in terms of a single scale, we define a representation of the 20 amino acids as points in a 3-dimensional hydrophobicity space and display it by means of a minimal spanning tree. The dominant scale is found to be close to two scales derived from contact potentials.
The topological structure of the minimal spanning tree is shown in the following figure. The tree is labelled twice, by the standard three letter and one letter abbreviations for the amino acids.
Acknowledgment. The authors gratefully acknowledge partial support of this research by the Austrian Fonds zur Förderung der wissenschaftlichen Forschung (FWF) under grant P11516-MAT.
The assignment of the amino acids to a quantitative hydrophobicity scale is a controversial problem. A review and evaluation of 46 different scales is given in the paper
J.L. Cornette, K.B. Cease, H. Margalit, J.L. Spouge, J.A.
Berzofsky and C. DeLisi, Hydrophobicity scales and computational techniques
for detecting amphiphatic structures in proteins, J. Mol. Biol. 195 (1987),
659-685.
Sometimes, different scales differ widely already in the order in which the
amino acids appear. This suggests that different scales measure different
properties more or less directly related to hydrophobicity. The diversity of the
scales then finds a natural explanation in the fact that amino acids cannot
naturally be ordered in a linear way. However, a representation in
higher-dimensional space might be possible in such a way that close amino acids
have similar properties.
Restricting ourselves to those properties that are reflected in the known
hydrophobicity scales, we performed a principal component
analysis (see also here) of 40 scales,
namely the 39 complete scales from the above survey (the 7 others are
incomplete) and another scale (of so-called `q-values') from
H. Li, C.
Tang and N. Wingreen, Nature of driving force for protein folding - a result
from analyzing the statistical potential, Phys. Rev. Lett. 79, 765 (1997).
Keeping only the three dominant principal components, we found the following
coordinates of a three-dimensional representation of the 20 amino acids. It is not surprising that the first (dominant) coordinate x represents the
bulk of the information in the 40 scales (75.7 percent). It can be considered to
be most closely related to the amount of polarity or hydrophobicity, the common
concept that all scales are supposed to measure. The following figure represents the three scales by grey level bars (positive
values are drawn dark, negative ones light) and in additions by marking the
levels with crosses. The top scale contains x, the best compromise to a linear
hydrophobicity scale. (The numbers are the positions of the amino acids in a
lexicographic ordering.)
A minimal spanning
tree analysis (see also here)
reveals that the appropriate nearest neighbor relation between amino acids is
not fully linear. (The missing third dimension is indicated by the size of the
markers; the fattest dots have a large positive missing coordinate.)
The deviation from a linear ordering can also be seen by looking at a display
of the distance matrix. Here the ordering has been chosen by appending the
branches of the topological tree at the closer ends of the `backbone' of the
tree. The distance is coded by grey values; dark entries correspond to close
pairs.
Finally we consider how close a linear transformation of the original 47
scales is able to approximate the dominant scale x found above. We plotted each
scale, linearly transformed to the range [-1,1], against x. (The 7 incomplete
scales not used in the principal component analysis are marked with an
asterisk$^*$.)
As one easily sees from the plot, the scale that gives the best approximation
to the dominant x scale (with a correlation of 0.982) is scale 47, the scale by
Tang
et al.. The other scales 1-46 correspond to Cornette
et al. according to the following list. (Among these, the scale closest to x
is scale 33=MIJER by Mijazawa and Jernigan, whose contact potentials were also
used in a different way for the derivation of the Tang et al. scale.) Note, however, that the correlation coefficient is a quite generous measure
of closeness of two scales. In particular, whether one transforms the Tang et
al. scale (whose correlation coefficient 0.982 with x is best) linearly such
that either (i) its mean and standard deviation agrees with x (see x' below) or
(ii) the extremal values are at -1 and 1 (see x'' below), the scales don't match
very closely:
ALA ARG ASN ASP CYS
.06 .80 .70 .97 -.56
-.25 .19 -.06 -.08 -.40
.25 -.41 .17 .08 -.14
GLN GLU GLY HIS ILE
.71 .85 .32 .15 -1.00
-.02 -.10 -.32 -.03 -.03
.12 -.05 .28 -.10 .10
LEU LYS MET PHE PRO
-.83 1.00 -.68 -.99 .45
.05 .32 -.01 .18 .23
.01 .11 .04 .15 .41
SER THR TRP TYR VAL
.48 .38 -.57 -.35 -.75
-.15 -.10 .31 .40 -.19
.23 .29 .34 -.02 .03
The precise procedure was as follows: By a linear
transformation we normalized each scale to mean 0 and standard deviation 1, then
computed the singular value decomposition (also known as Karhunen-Loeve
transform) and kept only the contributions to the three dominant scales x, y,
and z. (The degree of explanation of the three scales, defined as the quotient
of the corresponding squared singular value and the sum of squares of all
singular values, was 75.7, 7.2, and 5.5 percent accounting for all but 11.6
percent of the information in all 40 scales.)
By another linear
transformation we shifted the scales such their range was symmetric about zero,
and multiplied all coordinates by the same constant such that x ranges between
-1 and 1. We then rounded the values to two decimal places. (The large
disagreement between the various scales implies that there is no point in
representing the scales with higher accuracy; probably only the first figure is
significant.)
Thus we regard x as the
best compromise to a linear hydrophobicity scale. Polar amino acids have
x>>0, hydrophobic ones have x<<0. Since between -.35 (TYR) and .06
(ALA), there is a large gap in the possible values for x, at least the
classification into more or less polar amino acids and more or less hydrophobic
ones is unambiguous.
label Cornette code correlation with
x y z
1 EXP ZIMMR 0.60 -0.42 0.13
2 EXP N TAN 0.81 -0.84 0.10
3 EXP NTANR 0.74 -0.71 0.04
4 EXP JONES 0.69 -0.57 -0.10
5 X/S LEVIT 0.84 -0.12 -0.37
6 X/S HOPPW 0.87 -0.08 -0.30
7 EXP YUNGD 0.86 -0.28 -0.30
8 EXP FAUPL 0.94 -0.08 -0.20
9 EXP ZASLZ 0.56 -0.58 -0.18
10 EXP WOLF 0.72 0.40 -0.48
11 EXP KUNTZ 0.69 0.18 -0.22
12 EXP ABODR 0.92 -0.22 -0.15
13 EXP MEEK 0.66 -0.47 -0.35
14 EXP BULDG 0.81 -0.36 0.06
15 AVE EISEN 0.86 0.18 -0.43
16 AVE KYTDO 0.89 0.30 -0.12
17 STA CHOTH 0.86 0.42 -0.11
18 STA WERSC 0.92 -0.04 0.21
19 STA JANIN 0.83 0.39 0.09
20 STA OLSEN 0.82 0.40 -0.17
21 STA MEIRO 0.95 -0.03 0.10
22 X/S PONNU 0.93 0.17 0.24
23 STA NNEIG 0.92 0.21 0.15
24 STA ROBOS 0.88 -0.24 0.12
25 STA CHDLG 0.78 0.40 -0.36
26 STA WSDLG 0.88 -0.01 0.20
27 STA JADLG 0.83 0.46 -0.22
28 STA GUY 0.93 0.17 -0.04
29 AVE GUY M 0.970 0.08 0.10
30 X/S KRIDG 0.78 -0.44 -0.20
31 X/S KRIGK 0.91 0.15 0.08
32 STA NIOII 0.91 0.16 0.22
33 STA MIJER 0.973 0.02 0.10
34 STA ROSEF 0.96 0.18 0.09
35 STA SWEET 0.91 -0.30 0.05
36 STA SWEIG 0.91 -0.31 0.05
37 X/S REKKR 0.82 -0.42 0.05
38 X/S VHEBL 0.79 0.09 -0.52
39 X/S FROMM 0.79 -0.55 -0.03
40 X/S EIMCL 0.87 -0.21 -0.36
41 STA PRIFT 0.91 0.04 0.34
42 STA PRILS 0.91 -0.01 0.32
43 STA ALFT 0.89 -0.06 0.27
44 STA ALTLS 0.90 -0.03 0.29
45 STA TOTFT 0.93 0.01 0.31
46 STA TOTLS 0.92 0.00 0.32
47 TANG ET AL. 0.982 -0.08 0.05
Plotting the correlations for the various scales reveals a marked
difference between experimental (EXP) and statistical (STA) scales. (Scales
marked X/S are based on a mixture of experiment and statistics, and scales
marked AVE are averages of other scales. The Tang et al. scale is marked STA.)
x x' x''
0.06 0.16 0.25
0.80 0.59 0.65
0.70 0.58 0.64
0.97 0.79 0.84
-0.56 -0.42 -0.30
0.71 0.72 0.78
0.85 0.84 0.89
0.32 0.48 0.55
0.15 0.23 0.32
-1.00 -0.93 -0.79
-0.83 -1.16 -1.00
1.00 0.96 1.00
-0.68 -0.68 -0.55
-0.99 -1.13 -0.98
0.45 0.45 0.52
0.48 0.63 0.69
0.38 0.44 0.51
-0.57 -0.55 -0.43
-0.35 -0.26 -0.15
-0.75 -0.62 -0.50
Molecular Modeling of Proteins
Arnold Neumaier (Arnold.Neumaier@univie.ac.at)