![]() |
http://www.cs.ucdavis.edu/~koehl/ |
|
Protein Structure Classification2. Basic principles of protein structureWhile all bio-molecules play an important part in life, there is something special about proteins, which are the products of the information contained in the genes. A perhaps surprising finding that crystallized over the last handful of decades is that geometric reasoning plays a major role in our attempt to understand the activities of these molecules. In this section, the basic principles that govern the shapes of protein structures are briefly reviewed. More information on protein structures can be found in protein biochemistry text books, such as those of Schulz and Schirmer [6], Cantor and Schimmel [7], of Branden and Tooze [8], and of Creighton [9]. I also refer the reader to the excellent review of Taylor and collaborators [10]. 2.1 Visualization
The need for visualizing bio-molecules is based on the early understanding that their shape determines their function. Early crystallographers who studied proteins could not rely (as it is common nowadays) on computers and computer graphics programs for representation and analysis. They had developed a large array of finely crafted physical models that allowed them to have a feeling for these molecules. These models, usually made out of painted wood, plastic, rubber and/or metal were designed to highlight different properties of the molecule under study. In the space-filling models, such as those of Corey-Pauling-Koltun (CPK) [11, 12] , atoms are represented as spheres, whose radii are the atoms' van der Waals radii. They provide a volumetric representation of the bio-molecules, and are useful to detect cavities and pockets that are potential active sites. In the skeletal models, chemical bonds are represented by rods, whose junctions define the position of the atoms. These models were used for example by Kendrew and colleagues in their studies of myoglobin [13]. They are useful to the chemists by highlighting the chemical reactivity of the bio-molecules and, consequently, their potential activity. With the introduction of computer graphics to structural biology, the principles of these models have been translated into software such that molecules could be visualized on the computer screen. Figure 1 shows examples of computer visualizations of myoglobin, including space-filling and skeletal representations. Many computer programs are now available that visualize bio-molecules. I only cite here MOLSCRIPT [14] and VMD [15], which have been used to generate most of the figures of this paper. 2.2 Protein Building Blocks Proteins are heteropolymer chains of amino acids, often referred to as residues. This term comes from chemistry and describes the material found at the bottom of a reaction tube once a protein has been cut into pieces in order to determine its composition. There are twenty naturally occurring amino acids that make up proteins. With the exception of proline, amino acids have a common structure, shown in figure 2A. Naturally occurring amino acids that are incorporated into proteins are, for the most part, the levorotary (L) isomer. Substituants on the alpha carbon, i.e. side-chains, range in size from a single hydrogen atom to large aromatic rings and can be charged or include only non-polar saturated hydrocarbons [16]; see table 1 and figure 2B.
2.3 Protein Structure Hierarchy
Condensation between the -NH3+ and the -COO- groups of two amino acids generates a peptide bond and results in the formation of a dipeptide. Protein chains correspond to an extension of this chemistry, resulting in long chains of many amino acids bonded together. The order in which amino acids appear defines the primary sequence or primary structure of the protein. In its native environment, the polypeptide chain adopts a unique three-dimensional shape, referred to as the tertiary or native structure of the protein [17] The amino acid backbones are connected in sequence forming the protein main-chain, which frequently adopts canonical local shapes or secondary structures, mostly a-helices and b-strands (see figure 3). The former is a right handed helix with 3.6 aminoacids per turn, while the latter is an approximately planar layout the backbone. Helices often pack together to form a hydrophobic core, while b-strands pair together to form parallel, or antiparallel b-sheets . Note that in addition to these two types of secondary structures, there is a wide variety of other commonly occurring sub-structures, referred to as super-secondary structure. More information on these sub-structures can be found in the work of Efimov [18-21]. 2.4 Three types of proteins Protein structures come in a large range of sizes and shapes. They can be divided into three major groups, corresponding to fibrous proteins, membrane proteins, and globular proteins. Fibrous proteins are elongated molecules in which the secondary structure forms the dominant structure. They are insoluble, play a structural or supportive role in the body, and are also involved in movement (such as in muscle and ciliary proteins). Fibrous proteins often have regular repeating structures. Keratin for example, which is found in hair and nails, is a helix of helices, and has a seven-residue repeating structure. Silk on the other hand is composed only of b-sheets, with alternating layers of glycines, and alanine and serines. In collagen, the major protein component of connective tissue, every third residue is a glycine, and many of the others are prolines. Membrane proteins are restricted to the phospho-lipid bilayer membrane that surrounds the cell and many of its organelles. These proteins cover a large range, from globular proteins anchored in the membrane by means of a tail, to proteins that are fully embedded in the membrane. Their function is usually to ensure transport through the membrane, ranging from simple ions to nutrients. The structures of fully embedded membrane proteins can be classified into two major categories: the all helical structures, such as bacteriorhodopsin, and the all beta structures, such as porins (see figure 4). Note that as of October 2004, there are 158 structures of membrane proteins in the PDB, out of which 86 are unique (see http://blanco.biomol.uci.edu/Membrane_Proteins_xtal.html).
Globular proteins have a unique structure derived from a non repetitive sequence. They range in size from hundred to several hundred residues, and adopt a compact structure. In globular proteins, non-polar amino acids have a tendency to re-group and form the core of the proteins, while polar amino acids remain accessible to the solvent. In the tertiary structure, b-strands are usually paired in parallel or anti-parallel arrangements, to form b-sheets. On average, the protein main-chain consists of about 25% of residues in a-helix formation, 25% of residues in b-strands, with the rest of the residues adopting less regular structural arrangements [22].
2.5 Geometry of globular proteins From the seminal work of Anfinsen [23] we know that the sequence fully determines the three-dimensional structure of the protein, which itself defines its function. While the key to the decoding of the information contained in genes was found more than fifty years ago (the genetic code), we have not yet found the rules that relate a protein sequence to its structure [24, 25]. Our knowledge of protein structure therefore comes from years of experimental studies, either using X-ray crystallography or NMR spectroscopy. The first protein structures to be solved were those of myoglobin and hemoglobin [13, 26] Currently (October 2004), there are nearly 27,700 protein structures in the PDB database [3, 4] of bio-molecular structures; see http://www.rcsb.org (Note that this numbers overestimates the number of different structures available as the PDB is redundant, i.e. it contains several copies of the same proteins, with minor mutations in the sequence and no changes in the structure). Table 2 lists the web addresses of protein structure databases and the resources available for analyzing these structures. As there are only two types of secondary structures (a and b , proteins can be divided into three main structural classes [27]: mainly a proteins [28], mainly b proteins [29-31], and mixed a - b proteins [32] . A fourth class includes proteins with little or no secondary structures at all, which are stabilized by metal ions and/or disulphide bridges. There has been significant effort put into classifying protein structures into their main folding class automatically: these efforts will be reviewed in the next section. In parallel, there has been significant work on predicting a protein folding class based on its sequence. More details can be found in [33-40]. The mainly a class, the smallest of all three major classes, is dominated by small proteins, many of which form a simple bundle of a helices packed together to form a hydrophobic core. A common motif is the four helix bundle structure (see figure 5). The most studied a structure is the globin fold, which has been found in a large group of related proteins, including myoglobin and hemoglobin. This structure includes eight helices that wrap around the core to form a pocket where a heme group is bound [13].
The mainly b class contains the parallel and antiparallel b structures. In these, the b strands are usually arranged in two b sheets that pack against each other and form a distorted barrel structure. There are three major types of b barrels, the up-and-down barrels, the Greek key barrels [41], and the jelly roll barrels (see figure 6). Most of the known antiparallel b structures, including the immunoglobulins have barrels that include at least one Greek key motif. The two other motifs are observed in proteins of quite diverse function, where functional diversity is obtained by differences in the loop regions that connect the b strands. b structures are often characterized by the number of b -sheets in the structure, and the number and direction of the strands in the sheet. This leads to a fairly rigid classification scheme [42], , which is quite sensitive to the definition of hydrogen bonds and b -strands.
The a-b protein class is the largest of all three classes. It can be subdivided into proteins that have a mainly alternating arrangement of a helices and b strands along the sequence, and those that have more segregated secondary structures. The former class can be itself divided into two groups: one with a central core of often eight parallel b strands arranged together into a barrel surrounded by a helices, and a second group that comprises an open, twisted parallel or mixed b sheet, with a helices on both side (see figure 7). A particularly striking example of a-b barrel is seen in the eight-fold b-a barrel (ba )8 which was found originally in the triose phosphate isomerase of chicken [43], and is consequently often referred to as the TIM-barrel (for a complete analysis, see [44-51]). Many of the proteins adopting a TIM barrel structure have completely different amino acid sequences and different functions. The open a /b-sheet structures vary considerably is size, number of b strands, and their strand order.
|