![]() |
http://www.cs.ucdavis.edu/~koehl/ |
|
Protein Structure Classification1. IntroductionThe molecular basis of life rests on the activity of large biological macro-molecules, including nucleic acids (DNA and RNA), carbohydrates, lipids and proteins. While each play an essential role, there is something special about proteins, as they are the active actors of cellular functions. In this paper, I describe the growing interest in unraveling the mysteries behind their functions, focusing on the effort of organizing the information obtained from structural studies of proteins. Firstly, I briefly relate this effort to the continuous developments of scientific classification in biology. Classification and biologyClassification is a very broad term which simply means putting things in classes. Any organizational scheme is a classification: objects can be sorted with respect to size, colors, origins, ... Classification is one of the most basic activities in any science, probably because it is easier to think about a few groups than it is to think about a whole population. Scientific classification in biology probably started with Aristotle, in the 4th century B.C. He divided all livings things into two groups, animal and plants. Animals were themselves divided into two groups, those with blood, and those without (at least no red blood), while plants were divided into three groups based on their shapes. Aristotle was the first in a long line of biologists who classified organisms in an arbitrary, though logical way that made it easy to convey scientific information. Among these biologists, it is worth citing the Swedish naturalist Corolus Linnaeus from the 18th century who set formal rules for a two name system called the binomial system of nomenclature, which is still used today. However, with the publication of "On the origin of species" by Darwin, the purpose of classification changed. Darwin argued that classification should reflect the history of life, that is species should be related based on a shared history. Systematic classifications were introduced accordingly, whose aims are to reveal the phylogeny, i.e. the hierarchical structure by which every life-form is related to every other life-form. The recent advances in genetics and biochemistry, the wealth of information coming from the genome sequencing projects and the tools of bio-informatics are obviously playing an essential role in the development of these new classification schemes, by feeding to the classifiers and taxonomists more and more data on the evolutionary relationships between species. Note that the genetic information used for classification is not limited to the sequence of the genes, but takes into account the products of these genes, and their contributions to the mechanisms of life. As function is related to shape, this is where protein structure classification will play a significant role in our understanding of the organization of life. Paraphrasing Jacques Monod, it is in the protein that lies the secret of life [1] The biomolecular revolutionAll living organisms can be described as arrangements of cells, the smallest units capable of carrying functions important for life. Cells can be divided into organelles, which are themselves assemblies of bio-molecules. These bio-molecules are usually polymers of smaller subunits, whose atomic structures are known from standard chemistry. There are many remarkable aspects to this hierarchy, one of them being that it is ubiquitous to all life form, from unicellular organisms to complex multi-cellular species like us. Unraveling the secrets behind this hierarchy has become one of the major challenges of the twentieth and now twenty-first centuries. While physics and chemistry have provided significant insight into the structure of the atoms and their arrangements in small chemical structures, the focus now is set on understanding the structure and function of bio-molecules. These usually large molecules serve as storage for the genetic information (the nucleic acids such as DNA and RNA), and as key actors of cellular functions (the proteins). Biochemistry, the field that studies these bio-molecules, is currently experiencing a major revolution. In hope of deciphering the rules that define cellular functions, large scale experimental projects are performed as collaborative efforts involving many laboratories in many countries. The main aims of these projects are to provide maps of the genetic information of different organisms (the genome projects), to derive as much structural information as possible on the products of the corresponding genes (the structural genomics projects), and to relate these genes to the function of their products, usually deduced from their structure (the functional genomics projects). The success of these projects is completely changing the landscape of research in biology. As of October 2004, more than 220 whole genomes have been fully sequenced and published, corresponding to a database of over a million gene sequences (see http://www.genomesonline.org/ [2]) , and more than a thousand other genomes are currently being sequenced. The need to store this data efficiently and to analyze its contents has led to the emergence of a collaborative effort between computer science and biology, referred to as bio-informatics. In parallel, the repository of bio-molecular structures [3,4] contains more than 27,600 structures of proteins and nucleic acids. The similar need to organize and analyze the structural information contained in this database is leading to the emergence of another partnership between computer science and biology, namely bio-geometry. The combined efforts of bio-informatics and bio-geometry are expected to provide a comprehensive picture of the protein sequence and structure spaces, and their connection to cellular functions. Note that the emergence of these two disciplines is often seen as a consequence of a paradigm shift in molecular biology [5] as the classical approach of hypothesis-driven research in biochemistry is being replaced with a data-driven discovery approach. I believe that in fact the two approaches co-exist, and that both benefit from these computer-based disciplines. OutlineThe next section describes proteins, and surveys their different levels of organization, from their primary sequence to their quaternary structure in cells. The following section surveys automatic methods for comparing protein structures, and their application to classification. I then describe the existing protein structure classifications, focusing on SCOP, CATH, and the DALI domain classification. Finally I conclude the paper with a discussion of the future of protein structure classifications.
|