![]() |
http://www.cs.ucdavis.edu/~koehl/ |
|
Optimization1. Exploring Protein Sequence Space and/or Structure Space 1.1 A review of computational search techniques Simplified representations of protein structures such as lattice models and of protein sequences such as the Hydrophobic/Polar or HP model have been extensively used to study protein sequence design. We limit this brief review to off lattice studies, usually based on full atom representation of the protein structure. We focus on the techniques that have been used to solve the design problem. It should be mentioned that all these techniques have also been used for modeling protein sidechain conformation.
1.2 Systematic searchThe inverse folding problem is defined as finding sequences compatible with a given protein fold. Ponder & Richards [1] developed the first approach to solve this problem, under the assumption that residues in the interior of a protein determine its fold. They tested systematically combinations of side-chains fitting in the cores of small proteins, using criteria of steric overlaps, hydrogen bonding and packing density. The set of successful sequences thus selected defined the "tertiary template" of the structure. The number of residues that can be included in their combinatorial search is limited for computational reasons to less than 10. Consequently, systematic approaches have not been further developed for off-lattice protein design.
1.3 Monte Carlo methodsHellinga and Richards [2] applied a Monte-Carlo/simulated annealing algorithm to the sequence design problem. Their method minimizes a semi-empirical energy function simultaneously in sequence space and structure space. It takes as input the fixed three-dimensional coordinates of a protein backbone, and stochastically generates possible sequences through the introduction of random mutations. The corresponding coordinates are constructed by threading each sequence on the known backbone, and optimization in structure space is performed by random rotations around free torsional angles to generate a stochastic walk. The energies predicted by this method for sequences of a small group of residues in the hydrophobic core of the phage lambda cI repressor correlate well with experimentally determined biological activities [2]. Updated versions of this method are presently used for the design of specific metal binding sites in proteins [3]. Jiang et al [3] also implemented simulated annealing and Monte Carlo sampling to find sequence and rotamer combinations for potentially hyperstable proteins. Their program, CORE was used to design a hyperthermophilic protein. Torda and co-workers [4] have recently tested a biased Monte Carlo (BMC) approach to sequence design. BMC was found to be far more efficient than conventional Monte Carlo, especially on complex system. While this study was based on simple two-dimensional lattice model system, it is expected that BMC will be applied to off-lattice protein design soon.
1.4 The dead end elimination theoremMayo and co-workers [5,6] have developed another approach to protein design based on the dead end elimination (DEE) theorem. The technique of dead-end elimination (DEE) was originally developed as a tool for side-chain placement in homology modeling applications [7]. DEE aims at identifying the global minimum energy conformation (GMEC) of the protein / side-chain system, defined as the lowest possible energy state. It is based on a rotamer library of side-chain conformations. DEE allows the removal of certain rotamers from further consideration as the algorithm progresses, thereby limited the search space of amino acid rotamers. By definition, any state that is not the GMEC must have an energy value greater than or equal to the energy of the GMEC. If hµ and gµ are specific rotamer states of the residue µ, and {f} represents the conformations of all other side chains except µ, then:
where E(gµ) is the internal energy of rotamer state gµ and E(gµ,f'j) is the pairwise interaction energy between gµ and side chain j of {f'}. Otherwise stated, a rotamer gµ cannot be part of the GMEC if there exists another rotamer hµ that has a lower energy for every other possible configuration in the set .
Dahiyat and Mayo [6] were the first to apply this approach to the problem of sequence design. In their implementation, the amino acid identity was folded into the DEE definition of a rotamer, producing an entity containing both identity and configuration information. Through the elimination of dead-ending amino acid conformers, one could reach at the global minimum energy conformation/sequence for the energy function and backbone used in the calculation. Original implementations of DEE were slow, and limited to proteins of small size (usually smaller than 80 residues). Several algorithmic improvements have been published, mainly based on heuristic approximations of equation [8-10]. Recently, Looger and Hellinga [11] published a series of modifications to the pruning methods of the DEE, which allows the total design of proteins up to 300 residues. Sequence design based on the DEE theorem was successful in designing small and stable peptides [12-15]. It is interesting to note that DEE ignores the specificity issues.
The genetic algorithm (GA), is a computational equivalent of adaptive evolution which has become a very powerful technique for stochastic optimization applications. Variables are encoded as strings of number, resembling the strings of bases in DNA. A set of these variables with randomized values is called a genome, which are mutated through successive generations with selective pressure applied via a scoring function. Those genomes whose genetic information codes for a solution with a high fitness score are more likely to pass their information on to the next generation.
The design of novel proteins using a genetic algorithm was first implemented by Jones [16], who used the program EvolSeq to design the proteins Felix (a four helix bundle), "shpilka" (an 8-stranded β-sandwich protein), and "leather" (designed to bind an NAD cofactor). Although none of the designed sequences were ever synthesized, this represented the first attempt to design a complete protein sequence given only backbone coordinate information. Desjarlais and Handel uses a GA in their computer program "ROC" [17]. This program was initially applied to the problem of building stable variants of 434 Cro protein with different hydrophobic core residues [17].
The branch and bound procedure is best visualized as a search along the branches of a tree. In every node of the tree, each position in the protein of interest is associated with a set of rotamers (including the identity of the amino acid) that might be selected for this position. The root of the tree consists of one node containing all rotamer sets of the original problem. Moving one step down the tree means picking one position with several rotamers and generating a daughter node for each of them. At the other end of the tree, all leaf nodes contain sets with one rotamer each. Such nodes correspond to a fully specified rotamer selection. By searching through all leaf nodes one could in principle find the solution to the minimization problem. This simple scheme is however impractical, because of the combinatorial explosion. By ruling out that a particular branch contains an optimal solution, there is in fact no need to actually calculate all the nodes in that branch. This is usually done by deriving a lower bound Ebound on the energy of all the leaves in that branch. If Ebound is larger than the energy of any already known rotamer combination, the branch can be pruned and the search can be continued in another part of the tree.
Wodak and co-workers implemented this procedure in their program DESIGNER [18]. They applied their program to the re-design of the protein cores of c-Crk SH3 domain, the B1 domain of protein G and Ubiquitin. The best scoring sequences for the protein cores were found virtually identical to wild-type.
The goal of the inverse folding problem is to supply a list of sequences compatible with a known protein structure. If two-body interactions are taken into account in energy calculations, an exhaustive exploration of the energy landscape in sequence space cannot be achieved because of the huge number of possible combinations. All methods described above were chosen for their ability to circumvent this problem; most of them however still require long computing time, even for small problems. SCMF is a fast alternative solution that solves this latter problem and was initially applied to the problem of protein sequence design by Kono and Doi [19] and by Delarue and Koehl [20].
In this method multiple copies corresponding to every possible side-chain type are attached to each CA position in the protein. Each copy is given a weight stored in a sequence matrix SM. This matrix is initialized such that each amino-acid type has the same probability, for all residues in the protein (i.e. the system has no memory of the native sequence). Each residue is then considered in turn: the matrix row corresponding to residue i is updated, based on the mean field generated by the multiple side-chains at neighboring residues, and the procedure is repeated till convergence (i.e. self consistency) is reached. The refined matrix does not depend on the starting point; therefore the method succeeds in removing memory effects. This matrix is equivalent to the 3D-1D profile introduced by Eisenberg and co-workers [21]. Using a simple amino acid pair potential of mean force, Delarue and Koehl showed that this self consistent mean field approach retrieves significant wild type sequence information from the backbone of a protein [20]. It was also found to recognize structurally related proteins with low sequence identity, in cases in which the fold is well preserved, such as the globin fold. This success however was not found to be general. Limitations of SM may be related to the constraint of using a too precise structural template and/or to insufficiently discriminative potentials.
Saven and co-workers have worked extensively on the application of mean field theory to protein sequence design, initially based on lattice models [22, 23], and more recently on full-atom, off-lattice studies [24]. As described above, their method yields the likelihood of each amino acids at pre-selected position in a given protein structure. The energy function minimized by mean field includes VdW interactions and a potential for hydrogen bond formation derived from the AMBER force field [24]. It also contains a one-body environment potential. The latter contains no pair interaction terms and is dependent only upon the amino acid and rotamer states at each sequence position. They show how amino acid composition of the probabilistic sequence under design can be kept constant within the framework of mean field minimization. They have applied their method to calculate the identity probabilities of selected positions of the immunoglobulin light chain binding of protein L [24]. Their calculations compared favorably with the experimentally observed identity probabilities.
1. Ponder, JW and Richards, FM. Tertiary templates for proteins. Use of packing criteria in the enumeration of allowed sequences for different structural classes. J. Mol. Biol., 193, 775-791 (1987). 2. Hellinga, HW and Richards, FM. Optimal sequence selection in proteins of known structure by simulated evolution. Proc. Natl. Acad. Sci. (USA), 91, 5803-5807 (1994). 3. Jiang, X, Farid, H, Pistor, E and Farid, R. A new approach to the design of uniquely folded thermally stable proteins. Protein Sci., 9, 403-416 (2000). 4. Cootes, A, Curmi, P and Torda, A. Biased Monte Carlo optimization of protein sequences. J. Chem. Phys., 113, 2489-2496 (2000). 5. Dahiyat, B and Mayo, S. Protein design automation. Protein Sci., 5, 895-903 (1996). 6. Dahiyat, BI and Mayo, SL. De-Novo Protein Design : Fully Automated Sequence Selection. Science, 278, 82-87 (1997). 7. Desmet, J, Maeyer, MD, Hazes, B and Lasters, I. The dead end elimination theorem and its use in protein side-chain positioning. Nature, 356, 539-542 (1992). 8. Goldstein, RF. Efficient rotamer elimination applied to protein side-chains and related spin glasses. Biophys. J., 66, 1335-1340 (1994). 9. Lasters, I, Maeyer, MD and Desmet, J. Enhanced dead-end elimination in the search for the global minimum conformation of a collection of protein side chains. Prot. Eng., 8, 815-822 (1995). 10. Gordon, DB and Mayo, SL. Radical Performance Enhancements For Combinatorial Optimization Algorithms Based On the Dead-End Elimination Theorem. Journal Of Computational Chemistry, 19, 1505-1514 (1998). 11. Looger, L and Hellinga, H. Generalized dead-end elimination algorithms make large-scale protein side-chain structure prediction tractable: Implications for protein design and structural genomics. J. Mol. Biol., 307, 429-445 (2001). 12. Strop, P and Mayo, SL. Rubredoxin variant folds without iron. Journal of the American Chemical Society, 121, 2341-2345 (1999). 13. Shimaoka, M, Shifman, JM, Jing, H, Takagi, L, Mayo, SL and Springer, TA. Computational design of an integrin I domain stabilized in the open high affinity conformation. Nature Structural Biology, 7, 674-678 (2000). 14. Sarisky, CA and Mayo, SL. The beta beta alpha fold: Explorations in sequence space. Journal of Molecular Biology, 307, 1411-1418 (2001). 15. Ross, SA, Sarisky, CA, Su, A and Mayo, SL. Designed protein G core variants fold to native-like structures: Sequence selection by ORBIT tolerates variation in backbone specification. Protein Science, 10, 450-454 (2001).
16. Jones, DT. De novo protein design using pairwise potentials and a genetic algorithm. Prot. Sci., 3, 567-574 (1994). 17. Desjarlais, J and Handel, T. De novo design of the hydrophobic cores of proteins. Protein Sci., 4, 2006-2018 (1995). 18. Wernisch, L, Hery, S and Wodak, S. Automatic protein design with all atom force field by exact and heuristic optimization. J. Mol. Biol., 301, 713-736 (2000). 19. Kono, H and Doi, J. Energy minimization method using automata network for sequence and side-chain conformation prediction from given backbone geometry. Proteins: Struct. Funct. Genet., 19, 244-255 (1994). 20. Delarue, M and Koehl, P. The inverse protein folding problem: self consistent mean field optimization of a structure specific mutation matrix. in Proceedings of the Pacific Symposium on Biocomputing, eds Altman, RB, Dunker, AK, Hunter, L and Klein, T. World Scientific, Singapore (1997). 21. Bowie, JU, Luthy, R and Eisenberg, D. A method to identify protein sequences that fold into a known three-dimensional structure. Science, 253, 164-170 (1991). 22. Saven, J and Wolynes, P. Statistical mechanics of the combinatorial synthesis and analysis of folding macromolecules. J. Phys. Chem. B, 101, 8375-8389 (1997). 23. Zou, J and Saven, J. Statistical theory of combinatorial libraries of folding proteins: energetic discrimination of a target structure. J. Mol. Biol., 296, 281-294 (2000). 24. Kono, H and Saven, J. Statistical theory for protein combinatorial libraries. Packing interactions, backbone flexibility, and the sequence variability of a main chain structure. J. Mol. Biol., 306, 607-628 (2001). |