http://www.cs.ucdavis.edu/~koehl/


Protein Structure Classification


2. Basic principles of protein structure


While all bio-molecules play an important part in life, there is something special about proteins, which are the products of the information contained in the genes. A perhaps surprising finding that crystallized over the last handful of decades is that geometric reasoning plays a major role in our attempt to understand the activities of these molecules. In this section, the basic principles that govern the shapes of protein structures are briefly reviewed. More information on protein structures can be found in protein biochemistry text books, such as those of Schulz and Schirmer [6], Cantor and Schimmel [7], of Branden and Tooze [8], and of Creighton [9]. I also refer the reader to the excellent review of Taylor and collaborators [10].


2.1 Visualization

Figure 1: Visualizing protein structures. Myoglobin is a small protein very common in muscle cells, where it serves as oxygen storage.  Its structure was determined by X-ray crystallography as early as 1960 by John Kendrew and his collaborators [13]. It was in fact the first protein structure available.  Here I show the structure of sperm whale myoglobin using three different types of visualization.  For simplicifity, I do not show the heme.  The coordinates are taken from the PDB file 1mbd.  (A) Cartoon.  This representation provides a high level view of the local organization of the protein in secondary structures, shown as idealized helices.(B) Skeletal model.  This representation uses lines to represent bonds; atoms are located at their endpoints where the lines meet.  It emphasizes the chemical nature of the molecule (C) Space-filling diagram.  Atoms are represented as balls centered at the atoms, with radii equal to the van der Waals radii of the atoms.  This representation shows the tight packing of the protein structure.  Each of the representations is complementary to the others.  Figure drawn using MOLSCRIPT [14].

The need for visualizing bio-molecules is based on the early understanding that their shape determines their function.  Early crystallographers who studied proteins could not rely (as it is common nowadays) on computers and computer graphics programs for representation and analysis.  They had developed a large array of finely crafted physical models that allowed them to have a feeling for these molecules.  These models, usually made out of painted wood, plastic, rubber and/or metal were designed to highlight different properties of the molecule under study.  In the space-filling models, such as those of Corey-Pauling-Koltun (CPK) [11, 12] , atoms are represented as spheres, whose radii are the atoms' van der Waals radii.  They provide a volumetric representation of the bio-molecules, and are useful to detect cavities and pockets that are potential active sites.  In the skeletal models, chemical bonds are represented by rods, whose junctions define the position of the atoms.  These models were used for example by Kendrew and colleagues in their studies of myoglobin [13]. They are useful to the chemists by highlighting the chemical reactivity of the bio-molecules and, consequently, their potential activity.  With the introduction of computer graphics to structural biology, the principles of these models have been translated into software such that molecules could be visualized on the computer screen.  Figure 1 shows examples of computer visualizations of myoglobin, including space-filling and skeletal representations.  Many computer programs are now available that visualize bio-molecules.  I only cite here MOLSCRIPT [14] and VMD [15], which have been used to generate most of the figures of this paper.


2.2 Protein Building Blocks

Proteins are heteropolymer chains of amino acids, often referred to as residues.  This term comes from chemistry and describes the material found at the bottom of a reaction tube once a protein has been cut into pieces in order to determine its composition.  There are twenty naturally occurring amino acids that make up proteins.  With the exception of proline, amino acids have a common structure, shown in figure 2A.  Naturally occurring amino acids that are incorporated into proteins are, for the most part, the levorotary (L) isomer.  Substituants on the alpha carbon, i.e. side-chains, range in size from a single hydrogen atom to large aromatic rings and can be charged or include only non-polar saturated hydrocarbons [16]; see table 1 and figure 2B.

Classification

Amino acid

Non polar

glycine (G), alanine (A), valine (V), leucine (L), isoleucine (I), proline (P), Methionine (M), Phenylalanine (F), Tryptophan (W)

Polar

Serine (S), Threonine (T), Asparagine (N), Glutamine (Q), Cysteine (C), Tyrosine (Y)

Acidic (polar)

aspartic acid (D), glutamic acid (E)

Basic (polar)

lysine (K), arginine (R), histidine (H)


Table 1:  Classification of the 20 amino acids based on their interaction with water [16]. The one-letter code of each amino acid is given in parenthesis.  Non polar amino acids do not have concentration of electric charges and are usually not soluble in water.  Polar amino acids carry local concentration of charges, and are either globally neutral, negatively charged (acidic), or positively charged (basic).  Acidic and basic amino acids are classically referred to as electron acceptors and electron donors, respectively, which can associate to form salt bridges in proteins.  Amino acids in solution are mainly dipolar ions:  the amino group NH2 accepts a proton to become NH3+ and the carboxyl group COOH donates a proton and becomes COO-.


Figure 2: The twenty natural amino acids that make up proteins. (A) Each amino acid has a main-chain (N, Ca, C and O) on which is attached a side-chain schematically represented as R.  Amino acids in proteins are attached through planar peptide bonds, connecting atom C of the current residue to atom N of the following residue.  For sake of simplicity, I omit the hydrogens. (B) Classification of the amino acids side-chains R according to their chemical properties.  Glycine (Gly) is omitted, as its side-chain is a single H atom. Figure drawn using MOLSCRIPT [14].



2.3 Protein Structure Hierarchy


Figure 3: The three main secondary structure elements (SSE) found in proteins. For simplicity, side-chains and non-polar hydrogens are ignored.  The protein backbone is shown with balls and sticks, and hydrogen bonds are shown as discontinuous lines.  (A) The regular a-helix is a right handed helix, in which all residues adopt similar conformations, with the backbone torsion angles j and f close to -60 and -40, respectively.  The a-helix is characterized by hydrogen bonds between the oxygen O of residue i, and the polar backbone hydrogen HN (bound to N) of residue i+4.  Note that all bonds C=O and N-HN are parallel to the main axis of the helix.  (B) An anti-parallel b-sheet.  Two strands (stretches of extended backbone segments, with j and f close to -120 and 120, respectively) are running in an anti-parallel geometry.  The atoms HN and O of residue i in the first strand are involved in hydrogen bonds with the atoms O and HN of residue j in the opposite strand, respectively, while residues i+1 and j+1 face outwards.  (C) A parallel b-sheet.  The two strands are parallel, and the atoms HN and O of residue i in the first strand are involved in hydrogen bonds with the O of residue j and the HN of residue j+2, respectively.  The same alternating pattern of residues involved in hydrogen bonds with the opposite strand, and facing outwards is observed in parallel and anti-parallel b-sheets.  A strand can therefore be involved in two different sheets. Figure drawn using MOLSCRIPT [14].

Condensation between the -NH3+ and the -COO- groups of two amino acids generates a peptide bond and results in the formation of a dipeptide.  Protein chains correspond to an extension of this chemistry, resulting in long chains of many amino acids bonded together.  The order in which amino acids appear defines the primary sequence or primary structure of the protein.  In its native environment, the polypeptide chain adopts a unique three-dimensional shape, referred to as the tertiary or native structure of the protein [17] The amino acid backbones are connected in sequence forming the protein main-chain, which frequently adopts canonical local shapes or secondary structures, mostly a-helices and b-strands (see figure 3).  The former is a right handed helix with 3.6 aminoacids per turn, while the latter is an approximately planar layout the backbone.  Helices often pack together to form a hydrophobic core, while b-strands pair together to form parallel, or antiparallel b-sheets .  Note that in addition to these two types of secondary structures, there is a wide variety of other commonly occurring sub-structures, referred to as super-secondary structure.  More information on these sub-structures can be found in the work of Efimov [18-21].


2.4 Three types of proteins

Protein structures come in a large range of sizes and shapes. They can be divided into three major groups, corresponding to fibrous proteins, membrane proteins, and globular proteins.

Fibrous proteins are elongated molecules in which the secondary structure forms the dominant structure.  They are insoluble, play a structural or supportive role in the body, and are also involved in movement (such as in muscle and ciliary proteins).  Fibrous proteins often have regular repeating structures.  Keratin for example, which is found in hair and nails, is a helix of helices, and has a seven-residue repeating structure.  Silk on the other hand is composed only of b-sheets, with alternating layers of glycines, and alanine and serines.  In collagen, the major protein component of connective tissue, every third residue is a glycine, and many of the others are prolines.


Membrane proteins are restricted to the phospho-lipid bilayer membrane that surrounds the cell and many of its organelles.  These proteins cover a large range, from globular proteins anchored in the membrane by means of a tail, to proteins that are fully embedded in the membrane.  Their function is usually to ensure transport through the membrane, ranging from simple ions to nutrients.  The structures of fully embedded membrane proteins can be classified into two major categories:  the all helical structures, such as bacteriorhodopsin, and the all beta structures, such as porins (see figure 4). Note that as of October 2004, there are 158 structures of membrane proteins in the PDB, out of which 86 are unique (see http://blanco.biomol.uci.edu/Membrane_Proteins_xtal.html).



Figure 4: Two examples of membrane proteins.  (a) Bacteriorhodopsin (PDB code 1C3W) is a mainly a-protein, containing seven helices.  It is a membrane protein serving as an ion pump, and found in bacteria that can survive in high salt concentration.  (b) Porin (PDB code 2por) is a b-barrel.  Porins work as channels in cell membranes, which let small metabolites such as ions and amino acids in and out of the cell.  Figure drawn using MOLSCRIPT [14].



Globular proteins have a unique structure derived from a non repetitive sequence.  They range in size from hundred to several hundred residues, and adopt a compact structure.  In globular proteins, non-polar amino acids have a tendency to re-group and form the core of the proteins, while polar amino acids remain accessible to the solvent.  In the tertiary structure, b-strands are usually paired in parallel or anti-parallel arrangements, to form b-sheets.  On average, the protein main-chain consists of about 25% of residues in a-helix formation, 25% of residues in b-strands, with the rest of the residues adopting less regular structural arrangements [22].



Scheme

Description

Web address

PDB

Repository of protein structures

http://www.rcsb.org/

PDB at a Glance

Interface to PDB

http://cmm.info.nih.gov/modeling/pdb_at_a_glance.html

Molecules to Go

Interactive interface to the PDB

http://molbio.info.nih.gov/cgi-bin/pdb/

MSD

EBI interface to the PDB, with integration to EBI resources

http://www.ebi.ac.uk/msd/

PDBSum

Summaries and Structural analyses of PDB files

http://www.ebi.ac.uk/thornton-srv/databases/pdbsum

Biotech Validation Suite

Suite of programs that generates a quality control on protein structures

http://biotech.ebi.ac.uk:8400/

NRL_3D

Sequence-structure databases

http://laguerre.psc.edu/general/software/packages/nrl_3d/

Entrez

NCBI databases

http://www.ncbi.nlm.nih/gov/Database/index.html

SRS

Sequence Retrieval Services (includes structural information)

http://srs.embl-heidelberg.de:800/srs5/

DSSP

Database of secondary structures of proteins (available through SRS)

http://srs.embl-heidelberg.de:800/srs5/

TOPS

Generates a cartoon of the topology of a protein

http://www.tops.leeds.ac.uk/

PISCES

Protein sequence culling server: generates subsets of PDB based on users’ criteria

http://dunbrack.fccc.edu/PISCES.php/

Astral

Databases and tools for analyzing protein structure; derived from SCOP

http://astral.berkeley.edu/

Table 2: Resources on protein structures



2.5 Geometry of globular proteins

From the seminal work of Anfinsen [23] we know that the sequence fully determines the three-dimensional structure of the protein, which itself defines its function.  While the key to the decoding of the information contained in genes was found more than fifty years ago (the genetic code), we have not yet found the rules that relate a protein sequence to its structure [24, 25]. Our knowledge of protein structure therefore comes from years of experimental studies, either using X-ray crystallography or NMR spectroscopy.  The first protein structures to be solved were those of myoglobin and hemoglobin [13, 26] Currently (October 2004), there are nearly 27,700 protein structures in the PDB database [3, 4] of bio-molecular structures; see http://www.rcsb.org (Note that this numbers overestimates the number of different structures available as the PDB is redundant, i.e. it contains several copies of the same proteins, with minor mutations in the sequence and no changes in the structure).  Table 2 lists the web addresses of protein structure databases and the resources available for analyzing these structures.

As there are only two types of secondary structures (a and b , proteins can be divided into three main structural classes [27]: mainly a proteins [28], mainly b proteins [29-31], and mixed a - b proteins [32] . A fourth class includes proteins with little or no secondary structures at all, which are stabilized by metal ions and/or disulphide bridges.  There has been significant effort put into classifying protein structures into their main folding class automatically:  these efforts will be reviewed in the next section.  In parallel, there has been significant work on predicting a protein folding class based on its sequence.  More details can be found in [33-40].

The mainly a class, the smallest of all three major classes, is dominated by small proteins, many of which form a simple bundle of a helices packed together to form a hydrophobic core.  A common motif is the four helix bundle structure (see figure 5). The most studied a structure is the globin fold, which has been found in a large group of related proteins, including myoglobin and hemoglobin.  This structure includes eight helices that wrap around the core to form a pocket where a heme group is bound [13].




Figure 5: Two different topologies of four helix bundles.  A bundle is an array of a-helices, each oriented roughly along the same (bundle) axis.  A and C show a four helical, up-and-down bundle with a left handed twist, observed in hemerythrin from a sipunculid worm (PDB code 2hmz).  B and D show a four helix bundle with a right handed twist, observed in a fragment of the dimerization domain of a liver transcription factor (PDB code 1g2y).  A and B are cartoon representations of the proteins obtained with MOLSCRIPT [14], while C and D show the schematic topologies produced by TOPS (http://www.tops.leed.ac.uk/).


The mainly b class contains the parallel and antiparallel b structures.  In these, the b strands are usually arranged in two b sheets that pack against each other and form a distorted barrel structure.  There are three major types of b barrels, the up-and-down barrels, the Greek key barrels [41], and the jelly roll barrels (see figure 6). Most of the known antiparallel b structures, including the immunoglobulins have barrels that include at least one Greek key motif.  The two other motifs are observed in proteins of quite diverse function, where functional diversity is obtained by differences in the loop regions that connect the b strands.  b structures are often characterized by the number of b -sheets in the structure, and the number and direction of the strands in the sheet. This leads to a fairly rigid classification scheme [42], , which is quite sensitive to the definition of hydrogen bonds and b -strands.




Figure 6: Three common sandwich topologies of beta proteins: a meander (A and D) observed in a glycoprotein from chicken (PDB code 2cam), a Greek key (B and E) observed in an a-amylase (PDB code 1bli), and a jelly roll (C and F) observed in a gene activator protein from E. Coli (PDB code 1g6n).  A meander (or up-and-down) is a simple topology in which any two consecutive strands are adjacent and anti parallel.  A Greek key motif is a topology of a small number of b-sheet strands in which some inter-strand connection exist between b-sheets.  The jelly-roll topology is a variant of the Greek key topology with both ends crossed by two inter-strand connections.  A, B, and C are cartoon representations of the proteins obtained with MOLSCRIPT [14] , while D, E and F show the schematic topologies produced by TOPS ( http://www.tops.leed.ac.uk/).



The a-b protein class is the largest of all three classes.  It can be subdivided into proteins that have a mainly alternating arrangement of a helices and b strands along the sequence, and those that have more segregated secondary structures.  The former class can be itself divided into two groups:  one with a central core of often eight parallel b strands arranged together into a barrel surrounded by a helices, and a second group that comprises an open, twisted parallel or mixed b sheet, with a helices on both side (see figure 7). A particularly striking example of a-b barrel is seen in the eight-fold b-a barrel (ba )8 which was found originally in the triose phosphate isomerase of chicken [43], and is consequently often referred to as the TIM-barrel (for a complete analysis, see [44-51]). Many of the proteins adopting a TIM barrel structure have completely different amino acid sequences and different functions.  The open a /b-sheet structures vary considerably is size, number of b strands, and their strand order.




Figure 7: Topology (A) and cartoon representation (B) of the TIM barrel.  The protein chain alternates between b and a secondary structure type, giving rise to a barrel b-sheet in the center surrounded by a large ring of a-helix on the outside.  This structure, first seen in the triose phosphate isomerase of chicken ((PDB code 1tim, after which it is often name TIM barrel), has been observed in many unrelated proteins since then.  The topology is drawn using TOPS ( http://www.tops.leed.ac.uk/), and the cartoon is generated using MOLSCRIPT [14].



2.6 Protein Domains

Large proteins do not contain a single large hydrophobic core, probably because of limitations in the folding kinetics and stability.  Single compact units of more than 500 amino acids are rare. Large proteins in fact are organized into "units" with sizes around 200-300 residues, referred to as domains [52-54]. For a detailed analysis of domains in proteins, see [55]. Domains are defined simultaneously as:  (a) regions that display a significant level of sequence similarity; (b) the minimal part of a gene that is capable of performing a function; (c) a region of a protein with an experimentally assigned function; (d) region of a structure that recurs in different contexts in different proteins; and (e) compact, spatially distinct units of protein structure.  As more structures of proteins are solved, contradictions in these definitions appear.  Some domains are compact while others are clearly not globular. Some are too small to form a stable domain, and lack a hydrophobic core.  Currently, we are in the awkward situation in which the concept of structural domain is well accepted, yet its definition remains ambiguous [56]. This will be discussed in details in the next section.



2.7 Resources on protein structures

All experimental protein structures available today are stored in the Protein Databank (PDB) [3], maintained through the RCSB consortium [4], and available on the web at http://www.rcsb.org/.  Many services have been developed to supplement the PDB in order to ease access to the information in contains.  For example, the services "PDB at a glance" and "Molecules to Go" were designed as easy-to-use interfaces to the PDB with simple search engines.  The MSD search relational database is derived from the PDB, and has the aim of providing a knowledge discovery and data mining environment for biological structure data.  PDBSum [57, 58] and the Biotech Validation Suite are services from which quality control programs can be run to check the quality of a protein structure.  NRL, Entrez and SRS are integrated services that regroup the PDB with other databases on proteins.  For example, SRS includes DSSP [59], a database of secondary structures of proteins.  PISCES [60] and ASTRAL [61-63] can generate subsets of the PDB database, based on the user's criteria.  Table 2 lists the web addresses of all these services.




  Page last modified 3 January 2005 http://www.cs.ucdavis.edu/~koehl/BioEbook/