|
Patrice Koehl (University of California, Davis) Michael Levitt (Stanford University) |
|
|
|
We describe one application of our sequence design procedure to defining the subset of sequence space compatible with a protein structure. For a complete list of applications, see [1-5]. Application: Defining the subset of sequence space compatible with a protein structure [5]The sequences of naturally occuring proteins are defined by evolutionary selective pressure, which is controlled by a fine balance of function, stability and kinetics. While most random mutations of sequences are unlikely to enhance stability or function, they can be accepted by natural selection as long as they are neutral (or near neutral). As a consequence, the size of the sequence space compatible with a given protein fold is very large (although small compared to the full space a protein sequence can explore, whose size is 20N, where N is the number of residues in the protein). The 32,000 protein domains contained in the PDB as of March 2001 can be clustered into 564 different structural families of folds, and the size of these families are found to vary greatly [6]. A large number of these folds have a single representative, whereas other folds, such as the TIM fold or the Ig fold, have hundreds of representatives in the PDB [7]. The question arises whether these differences are a consequence of variations in function, in stability, in evolution, or in all three of the above. Our approach to provide an answer to this question is based on computational protein design. We propose to measure the size of the sequence space compatible with a protein structure.
The size of the sequence space trivially depends on the size (i.e. number of residue) of the protein considered. The size of the protein however is not the main determinant in defining the sequence space. To assess the roles of other factors, we designed in computer experiments families of sequences for two proteins os similar size , 1ctf (68 residues) and 2hsp (71 residues). The protein 1ctf is a small, highly stable protein whose fold is nearly unique. On the other hand, 2hsp adopts an SH3, which is observed in many unrelated proteins in the PDB. The diversity of the families of sequences compatible with 1ctf and 2hsp is defined by an entropy measure [12]. For each protein, the calculated design entropy is compared with the observed structural entropy, computed from the family of proteins in the PDB whose structure is similar to the structure of the protein of interest.
From this experiment, we note that the sequence spaces compatible with two proteins of similar size can have significantly different sizes. It is also noteworthy that the calculated design entropy correlates well on average with the observed structural entropy. The procedure described above was repeated on 11 proteins. These proteins vay in size from 56 residues to 310 residues, and cover all four classes of protein folds. For each protein, three measures of the size of its sequence space were derived. The three measures of entropy are compared in the figure below.
For most proteins, the entropy derived from sequence information compares well with the entropy derived from structural information. Proteins 1tim and 1ede are two major exceptions. In the case of tim for example, PSI-blast identifies only triose phosphate isomerases as similar to the native sequence of 1tim, whereas a large collection of sequences of proteins with the tim-barrel fold but unrelated to a triose phosphate isomerase are included in the computation of the structure entropy. In comparison, we do find a striking correlation, both qualitative and quantitative, between the entropy derived from our designed sequences (Sdes), and the entropy derived from naturally occuring sequences known to share the same fold (Sstr). The designed sequences are derived from the knowledge of the 3D conformation of the backbone of the protein (its topology or geometry) and the minimization of the free energy of the sequence for that protein. We consequently state that it is the topology of a protein, its length and its stability that predominantly define the size of the sequence space that is compatible with its structure. References1. Koehl, P and Levitt, M. De novo protein design. I. In search of stability and specificity. Journal of Molecular Biology, 293, 1161-1181 (1999). 2. Koehl, P and Levitt, M. De novo protein design. II. Plasticity in sequence space. Journal of Molecular Biology, 293, 1183-1193 (1999). 3. Koehl, P and Levitt, M. Structure-based conformational preferences of amino acids. Proc. Natl. Acad. Sci. (USA), 96, 12524-12529 (1999). 4. Koehl, P and Levitt, M. Improved recognition of native-like protein structures using a family of designed sequences. Proc. Natl. Acad. Sci. (USA), 99, 691-696 (2002). 5. Koehl, P and Levitt, M. Protein topology and stability define the space of allowed sequences. Proc. Natl. Acad. Sci. (USA), 99, 1280-1285 (2002). 6. Murzin, AG, brenner, SE, Hubbard T and Chothia, C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247, 536-540 (1995). 7. Brenner SE, Chothia C and Hubbard TJP. Population statistics of protein structures: Lessons from structural classifications. Curr. Opin. Struct. Biol., 7, 369-376 (1997). 8. Altschul SF, Madden TL, Schaffer AA, Zhang JH, Zhang Z, Miller W and Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nuc. Acids Res., 25, 3389-3402 (1997).
|