Why sequence genomes? The European view in 1988

When thinking about primitive evolving systems, that what can start most easily is unlikely to be at all similar to what will be selected in the long run. [...] Our lungs are not improved gills, nor is our way of walking related to the locomotion of the amoeba.

Genetic takeover and the mineral origin of life
Graham CAIRNS-SMITH

Related topics

Antifragility
Genomes 1992
Why sequence genomes?
2004: cells as computers
Maxwell's demon
Biology in silico
Our genome projects
Synthetic Biology

The SARS episode
First conjectures on SARS

European collaborations

Biosapiens
Probactys
Tarpol
Microme

Origin of genome projects

The Human Genome
Microbial databases
Bacillus subtilis

Genomics: The early biotechnology research programme of the European Communities (BRIDGE)

In the mid-1980s, it was widely believed that the knowledge accumulated about genes and their associated functions, which had been collected in data libraries, was sufficient to fully characterise the genome of living organisms. For this reason, most researchers were reluctant to organise genome sequencing. This was considered a waste of valuable resources that could be allocated to other research programmes. Nevertheless, in the spring of 1987, I proposed sequencing the genome of the model bacterium, Bacillus subtilis, at a time when André Goffeau and Piotr Slonimski were just setting up a project on the yeast genome. This led us to get together and write a research program for the Biotechnology Action Programme of the European Union's predecessor organisation. It took several years for microbiologists to accept the necessity of this undertaking. As is often the case, most of the current players in the field, who had been hostile or reluctant at the outset, rushed to participate in the victory of the programme once its importance was recognized.

The text of this white paper for the European Commission represents the vision we could have had in 1988-1989. Many ideas that were later followed by work around the world are explicit here. It was also the prelude of a new area devoted to research in biology via the use of computers, biology in silico, following the expression we proposed at the end of 1989 at a meeting on the yeast sequencing program in Tunzing, Germany. However, the first discovery of our genome projects, that a large proportion of genes were completely unknown, both in terms of their function and their origin, had not been foreseen. This discovery, presented at the 1991 Elounda meeting in Crete and resulting from work on the yeast chromosome III and the Bacillus subtilis genome, marked the beginning of the interest of the entire scientific community in these projects.

BIOTECHNOLOGY ACTION PROGRAMME
BAP: 1988 - 1989

COMPLETE GENOME SEQUENCING FUTURE AND PROSPECTS

Antoine Danchin

SEQUENCING THE YEAST GENOME
A DETAILED ASSESSMENT

A possible area for the future biotechnology research programme of the European Communities (BRIDGE)

In the first chapter of this white paper edited by André Goffeau I was asked to present why it would be important to sequence genomes, and to give some fallouts of these programmes. The text that can be read here is the original text of the white paper with correction of some typographic, language and orthographic errors. It represents the view that we could have back in 1988-1989. Many ideas that were later followed by work world wide are explicit here. This is the case, in particular, of the idea to look for a "minimal genome", made fashionable by Synthetic Biology. However the first discovery of genome projects, that a large fraction of genes were completely unknown, both in function and in descent, was not predicted (this is to be expected for a discovery, of course!) That discovery, presented at the Elounda meeting in 1991 marked the onset of the interest of the whole community of scientists for these projects.

1. Introduction

The autonomy of living systems is the consequence of internal self-consistency of their genetic program. Taken together, the rules specified by the DNA sequence decide of survival and reproduction of all organisms. This corresponds to a program which is finite in length. All « rewriting » rules that interpret the genetic program (transcription and translation) and fix the actual structure of all effectors of metabolism are totally included in the corresponding nucleotide and amino-acid sequences. Up to now the self-consistent feature of genomes has remained an inaccessible information, and it has not been possible to explore exhaustively the very nature of signals that constrain either the diachronic or structural building or the actual functioning of the cell.

New techniques in DNA sequencing, as well as improvements in information handling, already allow us to revisit this situation. It appears that total sequencing of bacterial genomes is certainly accessible to experimentation, although DNA sequencing of whole genomes will require a vast amount of work, especially if one hopes to investigate the structure of higher eucaryotes genomes. It seems therefore of the utmost importance to organize and make proper choices for a productive strategy. It is clear that, as such, the sequence of a genome of a single entity will be of very high basic value; it seems also clear—and this will be detailed later-that determination of the sequence of several genomes will increase the potential interest of sequence determination by a very high factor. Indeed, at the simplest level, sequencing will make available all signals that are important for the core machinery, responsible for gene expression and replication. A further important result will be that we shall have access to information corresponding to the ecological niche of the organisms considered. Indeed, the DNA sequence information that remains present after the core sequences have been subtracted¹, represents information specific to each type of organism. This will make understandable the way by which typical instances of various organisms cope with their environment.

The next domains which will be fertilized by knowledge of total genome sequences is the study of the evolution of species. This will result both from comparison of sequences of a given organism, and comparison of homologous sequences in different organisms. A major consequence will be that we shall have pathways relating known structures (at the three dimensional level) to unknown structures. This, most probably, will bring a major support to the general problem of prediction of protein tertiary structure, knowing only their primary structure.

Finally, when we shall be in a position to start sequencing genomes of multicellular eucaryotes, we shall start obtaining specific information on genes involved in cell differentiation.

In general it can be hoped that knowledge of the genes will help us to understand a variety of biological phenomena in a way more directly accessible than the usual converse way (i.e. going from the biological phenomenon to the gene).

2. The meaning of genomic self-consistency

Among the major features of living matter, autonomy is a prominent character. Since the number of metabolites, as well as the number of genes, is finite it seems worth trying to identify some of the classes that are interacting for the constitution of the minimal set required to allow functioning of the smallest organisms.

Protein synthesis is, more than replication or intermediary metabolism, the central process involved in life. Indeed replication, as such, although thought to be the obvious living feature, can be specific to organisms like viruses, that cannot live in the absence of preexisting life². This suggests that the cell's life is organised to cope with the environment around the protein biosynthesis machinery. It seems therefore likely that the ribosome will be a major memory of evolution. Thus, it is likely that it will reflect pathways of evolution in the structure of its RNAs and proteins. This is indeed the reason why Carl Woese and his colleagues have chosen to study ribosomal RNA structure as a specific tool to identify phylogenetic trees. Polypeptide structure of ribosomal proteins as well as codon usage in the corresponding genes will most probably bring insights into the relationships between species, as well as help identifying other ORFs³ in a given genome.

Replication is a second major function of the cell; and analysis of the genes involved in DNA wielding should allow classification of related proteins. This should also allow coping with the transcription machinery.

How much DNA sequence is required for these central items? From what is known at present it is possible to propose a rough estimate. In general one needs a set of ribosomal RNA per 400-500 kb of DNA , and a ribosome corresponds to 50 to 60 proteins, plus initiation, elongation and termination factors. This yields about 60 kb of DNA. Translation requires also tRNAs and tRNA synthetases. Translation accurracy and coupling to the cell metabolism also requires that the tRNAs have modified bases. A minimal set of tRNA s amounts to 25 genes, but a more reasonable estimate would be 40 different genes, permitting modulation according to the codon usage. One can estimate as 80 kb the DNA length necessary to accomodate this machinery^(N). From what is known about transcription it is also possible to evaluate the required amount of DNA, knowing in particular that RNA polymerase must have a complex structure (i.e. it must be different in complexity from polymerases such as bacteriophage T7 RNA polymerase) in order to permit accurate transcription of the genetic message. Also are needed control elements for initiation, termination and mRNA turnover. A minimum length of 25 kb seems necessary. In addition one probably needs 5kb more in order to allow transcription/translation coupling.

Replication and DNA wielding requires a polymerase system, and many components that allow semi-conservative replication in its fundamentally asymmetric dynamics. Here too, proteins allowing accurate transmission of the genomic information are of fundamental importance, as well as a system for correcting mutational events induced by gene damaging agents, always present in the environment. Finally, there is a specific metabolic control that has to be introduced, avoiding incorporation of uracil residues in the chromosome. All this amounts at least to 40 kb (including gyrase, topoisomerase, membrane DNA binding protein, etc. ).

The central gene expression and replication machinery requires therefore at least 210 kb of DNA. This is a very significant figure, and it amounts to an appreciable fraction of the total DNA content of the chromosome of bacteria for instance. This could justify sequencing of the corresponding regions as a start for a complete genome sequencing project. Two additional main features of a cell's organisation are now to be taken into account: its membrane structure, and its intermediate metabolism. We shall try to have a strict minimum evaluation of the gene information required, but this will certainly be of less precise nature than what has be en evaluated in the preceding paragraphs. Moreover one has to chose between making the building blocks from very simple molecules of the environment, or to have specific permeation systems.

The minimal requirement might be that everything important is built outside, and permeated into the cell. This asks for a specific permease for every metabolite, and coupling of the permease to the energetics of the cell. How many metabolites are we to consider? At least 20 aminoacids, 10 coenzymes, 5 nucleotides, glycerol and lipids. In addition, energy has to be continuously produced, or pumped from the medium: this corresponds to high energy organic phosphates. Also, because there is a necessary alteration of metabolites and macromolecules, several scavenging and excretion systems must be present. It is difficult to evaluate the genetic information that must be present in order to allow the presence of all these functions. A minimum of 200 kb seems reasonable. It must be emphasized however that this does not allow synthesis nor interconversion of the metabolites involved, nor does it take into account the carbohydrate metabolism of the cell. Indeed, it is not clear, at present, whether a metabolism involving carbohydrates is absolutely necessary for the cell to survive and multiply. However it seems that all living cells do transform carbohydrates not only for the purpose of making and storing energy, but also for maintaining the integrity of their envelope, be it only as protective or adhesive tools. If carbohydrates are to be a necessary requirement it seems that at least another 100 kb of DNA is a prerequisite in order to allow synthesis of the appropriate genes.

Thus, 400 to 500 kb of DNA are necessary to allow existence of the simplest cellular organism⁴. It seems therefore that all living systems will have a larger genome, as they do have, indeed. The extra piece of information which is present in the vast majority of organisms includes fine tuning of gene expression (control processes) as well as strategic means to occupy a specific niche in the environment. It seems likely that in all organisms this core genetic system will have features common to all. Since this amounts to 10% of the genome of bacteria such as E. coli or B. subtilis looking for appropriate homologies should allow us to have a "nucleation center", for programs starting investigation of unassigned DNA sequences. A genome is self-consistent, and it is likely that coevolution of most genes (i.e. all genes except those that might have been acquired recently, such as insertion sequences, transposable elements or proviruses) will have a strong relationship to structural features present in the core genes. This is an argument for trying to sequence, among other genomes, the genome of an organism having the smallest possible chromosome, just as a tool for easy identification of the core genes⁵.

Self-consistency will be reflected in all processes involving a global behaviour of the cell. This includes global transcriptional controls (such as catabolite repression, heat or oxygenshock,osmoregulation, etc.), as well as processes involving an integral membrane structure, such as protein secretion. This means that significant features must be present that are recognized as specific by the appropriate machineries: transcriptional control elements must be recognized by pleiotropic activators or repressors; alternatively repressors or activators may be cleaved by specific proteases, etc. Knowledge of a whole set of genes, expressed in a coordinate fashion in a given organism will allow identification and appropriate description of their pertinent features, a feat that can only be tentatively approached in the absence of the integral knowledge of all signals involved. It seems therefore apparent that determination of the total DNA sequence of an organism will bring a major contribution to understanding of hierarchically high level controls. Obvious consequences, apart from basic knowledge of living systems, are the possibility of appropriate gene construction for industry, for instance.

3. Which organisms DNA have to be sequenced?

3. 1. Procaryotes

If self-consistency is one of the major properties of the cell's genome it seems important to identify which genome should be of prominent interest for a specific contribution to understand living systems. As stated above it is of interest to get some insight of the core genes that should be present in all organisms. It is however difficult at present to know whether knowledge from bacteria could be extended to eucaryotes, and to what extent.

Mycoplasma, Chlamydiae and Rickettsiae are organisms having the smallest known genome. It might therefore be interesting to undertake sequencing of the genome of one such species. Among them Chlamydia trachomatis is a major agent of a very frequent sexually transmitted disease, and for this reason might be of interest. Its drawback, however, is that it is an obligate intracellular parasite, perhaps because it lacks many pathways of intermediary metabolism, and must feed at the expense of metabolites synthesized by its host. In spite of this, it might be a further reason to undertake the sequencing of its genome, because this might be a major, if not the only, way to have access to the function of its genes⁶.

Archaebacteria seem to be the next organisms having representatives with small genomes. From the point of view of research dealing with extreme environment and phylogeny they are probably organisms of choice. A problem however, is to select, among them, appropriate candidates. Indeed they appear to represent a quite heterogenous set with highly distinctive properties, such as methanogenesis, halophily, thermophily, etc. . . It will certainly be important in the future to undertake sequencing of some of their genomes⁷.

With other bacteria we are faced with a large number of organisms of major interest. Escherichia coli is obviously an organism of choice, and its genome (4,700 kb) will certainly be sequenced in the near future. The sequence of more than 300 of its genes is already known, and several groups have given a physical map using restriction enzymes with recognition sites of 8 base pairs (by Blattner and his colleagues and Cantor and his colleagues) and, even with 8 enzymes with recognition sites of 6 basepairs (by Isono and his colleagues). Several differences with known maps in several fragments of this latter map cast, however, some doubt about the general validity of its use. Nevertheless it seems appropriate to predict that most of the E. coli genome sequence will be known well before 5 years⁸.

Fragmentary data from other enterobacteria (mainly work on Salmonella typhimurium and also on Klebsiella pneumoniae, which are quite neighbour to E. coli, and distant organisms from the genera Erwinia and Yersinia), indicate that what is known for E. coli could be extended, except for very specific genes to other enterobacteria. Accordingly, genomic sequence determination should be undertaken for other classes of eubacteria.

In the Gram-negative family Pseudomonads are a highly disperse group of organisms. It seems clear that their biotope corresponds to an extremely powerful ability to catabolize all sorts of molecules. As they appear to differ widely from Enterobacteria it seems interesting to undertake genomic sequence determination for at least one such organism. Indeed this might result in obtaining a phylogenetic tree for families of catabolic enzymes, which might be of importance in projects dealing with protein tertiary structure prediction as well as for industrial purposes. A drawback of such study is that most pseudomonads are relatively GC-rich, which rises technical sequencing problems, and above all, yield non significant ORFs. Two organisms seem good candidates for the project Pseudomonas aeruginosa, because it is an opportunistic pathogen, and because some of its gene organisation is already well known⁹, and Xanthomonas campestris¹⁰, because it is a plant pathogen of world-wide importance related to another pathogen (Pseudomonas solanacearum¹¹), and because it is used at a large scale in industry for production of a food stabiliser (xanthan gum^¹²).

In the Gram-positive organisms, sporulation bacteria seem of major interest, because of this specific way to cope with hostile environments. Bacillus subtilis is a paragon of such organisms, and its genetics is well developed. A project for genomic sequencing of this organism seems therefore of a high priority¹³. Indeed, in addition to the obvious basic and industrial interest of this organism, its high AT content should be of considerable help both for sequencing and assigning significant ORF's. Many other bacteria from this group are of interest, but they still represent very disperse families, and there does not appear to exist yet significant reasons to prefer one rather than the other. However, as soon as sequencing techniques will have improved it seems clear that lactic acid bacteria, Corynebacteria and Clostridiae should be considered.

This summarizes the obvious projects that should be undertaken in the near future with procaryotic organisms. Several classes remain, however: noticeably, blue green algae, for their importance with respect to photosynthesis, and Actinomycetes, because they have a fascinating secondary metabolism. Their high or very high GC content, as well as their usually large genome, precludes, however, to consider them as being of interest for DNA sequencing projects in the near future.

3.2. Eucaryotes

3.2.1. Unicellular organisms

With eucaryotes we deal with organisms having much larger genomes, and the strategy, as well as the choice of organisms should be considered with even more caution. If one considers the amount of literature dealing with such organisms it appears clearly that they are central to most research projects. The reason is that eucaryotes have led to cell differentiation, which is of obvious interest for one particular eucaryote, man! Ultimately, it seems, our anthropocentric view will certainly consider as the first priority a project which could result in knowledge of the sequence of the whole human genome. Apart from ethical problems, which should certainly be considered with no less high priority, the scale of the project—1,000 times the scale considered for a bacterial DNA sequencing project—is so large that one has to perform several intermediate studies in order to allow scaling up. This means that eucaryotes with small genomes, be it only for this reason and not only for the sake of their own interest, should be considered first.

Following the line of reasoning that we have proposed in the preceding paragraph, it seems interesting to start with organisms which could represent the "core" of an eucaryotic cell. Fungi and Yeasts, as well as unicellular eucaryotic algae, may be appropriate choices.

Obviously, however, the case of Saccharomyces cerevisiae seems most appropriate, especially because its genetics is extremely well developed. Gene libraries are already available, and 17 chromosomes have been identified¹⁴ and are easily purified by pulsed field gel electrophoresis. The overall DNA content of an haploid strain amounts to ca 15,000 kb¹⁵. Therefore, although significantly larger than the E. coli genome, the Yeast genome seems accessible to a project of total DNA sequencing. Obviously it will be of the utmost interest to compare genes corresponding to the "core" set as well as genes of intermediary metabolism to homologous genes in bacteria. In addition nucleo-mitochondrial interactions will be put in the limelight once appropriate targeting signals have been identified (some are already identified, which should help clarify this issue).

Several fungi are of basic and industrial interest and might be susceptible to a project of total genome sequencing. This is the case for instance of Aspergillus nidulans (26,000 kbp)¹⁶. In this case, however, it is not clear whether the laboratory strains are isogenic and to what extent they are polymorphic. A preliminary coordination between people involved in research in this field would be therefore a prerequisite for a sequencing project.

Many other unicellular eucaryotes are worth considering, and protozoa might be of great interest. Moreover not much is known about the actual structure of their genome and they should be only considered once appropriate breakthroughs have been made in cloning and sequencing techniques.

3.2.2. Plants and Metazoa

The next step, clearly, concerns multicellular organisms. However the total length of DNA involved now becomes large or very large, and one should certainly launch a detailed investigation of genome length before starting.

In the case of plants the genome of Arabidopsis thaliana seems very compact (ca 70,000 kb)¹⁷ and since this corresponds to a plant exhibiting all characters of interest (including nucleo-mitochondrial and nucleo-chloroplastic interactions, as well as availability for gene transfer) plant molecular biologists should consider that a project of total genome sequencing should be started in a not too late future. In the mean time it seems worth considering whether small genome variants of this plant exist, and to evaluate the number and nature of internal repetitive sequences.

With respect to animal organisms, small genomes exist in several types of insects. Since genetics of Drosophila melanogaster is so well developed it seems worth investigating the potentialities for cloning and sequencing its whole genome¹⁸. Estimations of its length usually correspond to values somewhat lower than that of Arabidopsis (ca twice the yeast genome length). There are quite a few repetitive sequences and rRNA genes account for approximately 1,000 kb.

As the size of genomes increases the length of repetitive sequences, as well as sequences behaving as "archives" signalling past evolution becomes dominating. In fact, when one considers mammalians for instance, it appears that at most 10% (and probably significantly less) of the total DNA sequence of the genomes has a true (and not mainly historical) significance in gene expression¹⁹. Therefore, there is a very large risk that pertinent information be overflowed by information that could only be interpreted in the light of past evolution. Inparticular the self-consistency of the genomic information becomes less and less visible, and most internal controls (such as those used for assigning proper reading frames) become missing. In this respect the huge program of human DNA sequencing seems highly premature, and it is not at all clear whether we shall be in a position to undertake a thorough investigation of the corresponding informational content in the near future²⁰.

4 – Technical aspects

4.1. Biochemistry.

In principle DNA sequencing techniques are at present very efficient. In fact, systems using fluorescent labelling of DNA, coupled to the dideoxy chain termination technique of Sanger allow sequencing of several thousands nucleotides a day. However there is, at present, a very strong limitation due to repetitive sequencing artefacts and, above all, cloning procedures. Even artefacts can easily be solved when the starting origins of the sequenced fragments are changed: this corresponds to cloning of different fragments encompassing the same region. Therefore the most tedious part of sequencing corresponds to cloning of overlapping fragments²¹.

Random fragments cloning has often been used, either with the help of restriction enzymes, or after physical fragmentation of the DNA. Unfortunately, however, it appears frequently that regions are cloned very often, while others are completely absent. This necessitates that specific regions be cloned as such, after appropriate restriction sites are identified from the sequenced regions. For this reason, many authors now prefer techniques involving cloning of "large" fragments (usually 2 to 4 kb) followed by creating specific "nested" deletions. This technique allows easy alignment of overlapping fragments, and it can be combined to random cloning of "large" fragments. However it does not completely eliminate the fact that it is sometimes difficult to get overlapping fragments in specific regions. However this can be solved by using a "walking" procedure after synthesis of specific oligonucleotide primers, complementary to sequenced upstream regions. In fact, although such procedure introduces a delay in obtaining the sequence it is faster than trying to clone the missing region (or to obtain it from an appropriate deletion).

All considered, it is generally accepted that, with the presently available techniques it requires one person during one year to sequence (without gaps or errors! 30-50 kb of DNA²². This figure is very different from the figure found in many advertising propaganda, but it actually fits with what one finds when appropriate questions are asked to sequencing laboratories. Clearly, the cloning steps seem to be the major limitations at present. Sequencing machines will certainly help in collecting data and entering them into computers, but this will not shorten the main bottleneck of the technique. It seems that the length of what can be accurately sequenced is a critical factor: roughly speaking if one multiplies by n this length, the time required to sequence is divided by n2. The immediate goal should thus be to invent machines able to sequence accurately, and repetitively, 1 kb of DNA. This would reduce the time for one person to sequence 30-50 kb of DNA the length of a cosmid) to about 3 months. Finally all techniques allowing to sequence directly from the original libraries should be developed and improved.

Many other improvements could be made, and it seems even interesting to investigate whether other means of identifiying DNA sequences could not be proposed (for instance sequential specific fragmentation coupled to mass spectroscopy) and one should like to suggest to physicists or chemists to investigate the wildest hypotheses²³.

4.2. Computer Sciences

Curiously enough not much is said, generally, in projects dealing with genome sequencing, about information sciences. The ultimate goal of sequencing, however, is to obtain a large text, written with an alphabet of 4 letters, and having an unknown signification that one would very much like to decipher. This should involve a vast amount of information treatment with the most sophisticated techniques available in computer sciences.

The first important aspect in this domain is the usage of information sciences in order to minimize errors in sequencing. There should be, on sequencing machines, a first step of data treatment which would use internal consistency of the genome as a test for signalling possible errors. An important breakthrough at this level—which would require rather sophisticated computer sciences techniques— would be that the software be able to learn from the input data, and improve its control procedures. For instance assigments of ORFs might be improved as the codon usage table becomes more accurate. It seems important that during data acquisiton a first on-line treatment of the sequence be available: this requires analysis of pertinent signals, restriction sites, ORFs, and physical data (e.g. information coming from "compressions" or stops, due to the physical structure of the DNA)²⁴.

Fast screening of libraries (preferably protein databanks) using stretches translated from putative ORFs might also be of interest at this stage. Indeed, finding homologous sequences would certainly improve checking of frameshifting errors.

The most important work however, should be performed when significant (i.e. 1 kb) stretches of sequence are accurately known. Study of DNA (RNA) structure, as well as protein structure is an ultimate goal of sequencing. It is clear that, at present, this is an incaccessible target, but as time elapses more and more is known, which will allow investigators to predict the tertiary structure of a protein from its primary sequence. In this respect identification of homologous sequences is a very important step, and much should be done in order to discover appropriate homologies. Many different approaches, either with standard statistical techniques, or with artificial intelligence are at present being developed in many laboratories. It should be pointed out here, however, that if a project on total genome DNA sequencing is launched it would be most useful that, in parallel, a coordinate research program on different aspects of computer sciences related to sequences analysis be initiated at the same time. Apart from aspects that have just been described, it seems clear that other domains of biology will benefit from intelligent analysis of sequences, including the fields of evolution and origin of life.

5 . Likely fallouts

The different projects that have been briefly outlined above involve a significant amount of work: as example the smallest projects (1,000 kb) requires at least 25 person*year of work, and the bacterial project would require five times that figure. It seems therefore natural to list briefly a few fallouts that should occur during, or at the outcome of such projects.

The first, direct, result is the availability of a precise, and final, genetic map of an organism. As a consequence any new gene, isolated after its biological properties, will be placed on the map, and related to adjacent genes and functional units. This is especially interesting in the case of bacteria, where genes are often organized into operons, but, for eucaryotes, where introns play a major role, this might also be of extreme importance. A corollary is that probes, for any gene, would be available: this is clearly of importance in many fields, from food industry to medical sciences. In addition, if the organisms are well chosen, they can be used to probe organisms placed in the same lineage.

A second important result will be that genes should be associated in families, according to specific features corresponding to collective regulation. This will allow investigators to uncover important regulatory patterns which would be, otherwise, overlooked. As a consequence the hierarchical organization of cell gene expression will be addressed in a synthetic manner. This is of major importance if one is to use organisms for industrial purposes, or if one has to think of appropriate mechanisms to control the spreading of a given organism (in medicine for instance). An other aspect of such whole cell control, will be the study of protein export, or membrane integration. Industrial consequences are obvious.

A third result concerns proteins, and their likely folding patterns: classification will permit us to find homologies, and proteins will be grouped according to families of activities (for instance enzyme activity). It seems likely that catalytic activities could be grouped as clusters of related structural shapes. Indeed in many cases one will find that specialized functions derive from a single ancestor enzyme with broad specificity²⁵. This will be true especially in the case of catabolic enzymes. Thus, evolution will have provided a large set of structure/function couples, that will certainly help designing new activities after localized mutagenesis. In a more basic direction, knowledge of whole genomes sequence will help building phylogenic trees, and the question of the origin of life will adress more specific points. It seems likely, indeed, that constraints which underly DNA replication, will be prominent when considering a whole genome. One can even hope to find rules that have been at the roots of genes construction and evolution, and have ideas about the nature and structure of ancestral genes.

Footnote of the original white paper

N An average protein has a molecular mass of ca 30 kDa, corresponding to 1 kb of DNA, but complex structures such as tRNA synthetases, and DNA polymerases etc, require much more. In general one might evaluate the gene length required for one site catalytic, regulatory, etc) to measure 1 kb. <back>

Notes (2012)

1 This is what we now name the cenome of a particular species, i.e. the genes that allow the species to live in a particular environment. "Cenome" has been chosen to remind us of the concept of biocenosis (biocenose) proposed by Karl Möbius in 1877.

2 Virus are not alive, even when they are very large. They need life, and in particular the whole translation machinery (ribosomes in particular) to propagate.

3 At the time when this white paper was written it was common to define genes as Open Reading Frames (ORFs, for short). Yet this is a very misleading definition. An ORF is a multiple of three between two putative translation termination codons (UAA, UAG, UGA). However, a protein CoDing Sequence (CDS) must start with an AUG (possibly GUG or UUG) codon as a start. Hence, a CDS is always contained within an ORF (there are exceptions: some CDSs span termination codons read as coding for rare aminoacids, selenocysteine and pyrrolysine, and in some cases they may also involve a "programmed frameshift"). Lacking this proper definition many genome sequences were wrongly annotated with long ORFs automatically taken for CDSs. First sequence annotations were also full of spurious fairly short genes.

4 Remarkably, this was exactly the size of the smallest autonomous living organism known six years later, Mycoplasma genitalium, and this figure held for a long time.

5 This argument was known to European and French agencies at the highest political level in 1989, and it is therefore quite revealing (and very unfortunate) that, in the end, it is the group of Craig Venter, who became involved in microbial genomics much later, who succeeded, in 1995 to sequence the genome of M. genitalium.

6 The sequence of the 1 Mb genome of C. trachomatis was deciphered only in 1998. Lack of political will and reluctance of the scientific community, in Europe and in particular in France, went against genome projects.

7 The 1.74 Mb sequence of the genome of Methano(caldo)coccus jannaschii was deciphered by C. Venter's group in 1996.

8 This prediction was far too optimistic, and based on the hype created by some mass-media oriented investigators. A bitter fight within the USA, and between the USA and Japan made that the sequence was known only in 1997, after a world-wide appeal from E. coli specialists (including ourselves), in 1995 asked the US federal government to support the project.

9 Its 6.26 Mb sequence was deciphered in year 2000.

10 It is now established that this organism does not belong to Pseudomonads, but to a distinct clade of gamma-Proteobacteria (Xanthomonadales). The 5 Mb sequence of a first strain was deciphered in 2002 by a Brasilian consortium.

11 This species is no longer accepted as a Pseudomonas sp. Its new name is Ralstonia solanacearum, a member of beta-Proteobacteria, not gamma-Proteobacteria. A first genome sequence was obtained in 2002 at the Genoscope, France.

12 Considerably more applications were derived from xanthan gum.

13 We initiated the B. subtilis genome project, that was completed in 1997. For five years this was the only AT-rich large genome that had been sequenced.

14 Sixteen chromosomes and one set of mitochondrial chromosomes. The sequence was deciphered in 1996 by a consortium, mainly European, led by André Goffeau.

15 In fact 12.16 Mb. The initial strain sequence was a mosaic of sequences from various strains.

16 The sequence (30.21 Mb) was obtained in 2005, beside that of A. oryzae and the pathogen A. fumigatus.

17 In fact, much longer: 119 Mb spread into 5 chromosomes, and sets of mitochondrial and chloroplast chromosomes. The genome was sequenced in 2001 by an international consortium.

18 Despite its obvious interest D. melanogaster was not considered in priority, but replaced by the nematode Caenorhabditis elegans, a remarkable organism (it has a fixed number of cells and it is transparent), that is however very distant from the most important metazoa.

19 At the time of this white paper I had not explored the idea of logical depth (just proposed by Charles Bennett) and I could not consider that all sequences are in fact meaningful: there is no such thing as "junk" DNA.

20 In fact a first draft sequence of the human genome was available at the turn of the century, driven by the competition between Craig Venter and the international academic community.

21 These drawbacks of the sequencing techniques then available have been entirely resolved since 2005-2007 with techniques that can sequence directly randomly fragmented DNA libraries. The problem of repeated sequences, however, still remains a difficult hurdle, that cannot be solved with fast approaches. This explains why, in 2012, most newly sequenced genomes are not, in fact, completely assembled. However a new technique, pore sequencing, is about to succeed in this domain and will solve the problem.

22 Progresses have been so fast that this figure can now reach gigabases per day per person... However, as stated in the preceding note, this results only in getting non assembled genomes. Solving the problem of repeated sequences, when they are numerous, is not yet possible with a fast approach early 2012. Now that sequencing of fragments longer than a few kilobases is fast and efficient, the problem is resolved.

23 This is what happened, and in 2012 the amount of sequence data now overcomes the possibilities of sequence analysis, global storage and data exchange. The new hurdle is therefore to solve this incompatibility with the slow speed of Moore's law, as compared to genome sequencing capacity.

24 This approach was never implemented. It would have extracted much biological information from Sanger sequencing machines. It is no longer relevant to the present sequencing techniques.

25 This prediction was substantiated by much work. Protein promiscuity or moonlighthing activity is the most recent avatar of the prediction.

PIONEERING GENOME STUDIES (1988-1989)