Genetics of Bacterial Genomes

The 2006-2009 programme


Three major revolutions marked the development of Biology in the past decades. Molecular biology, based on the study of microbes and their viruses, the bacteriophages, uncovered the rules of gene expression, associated to the way expression is controlled in time and in space. Replication of DNA, transcription into messenger RNA molecules followed by translation into proteins, using the universal genetic code unified our view of what life is. In parallel, extraordinary technological developments in computer power associated to the sudden access to the very nature of genes by sequencing triggered the onset of the genomics revolution, just two decades ago. While microbes still play a seminal role there, much emphasis was placed on the Human Genome, and on analyses based on the study of nucleated cells. Simple eukaryotes such as budding and fission yeasts, as well as the old Drosophila model, associated to the new nematode model, Caenorhabditis elegans, entirely renewed our way of considering the ontological development of multicellular organisms. All the work pertaining to these objects concentrated on concepts of regulation cascades (thousands of works are devoted to cyclic AMP, protein phosphorylation, central regulators such as NFkappaB...) showing that Science is much the subject of fashion, as can be (unfortunately?) expected from a community made of some 2.5 million persons. In this context a few new concepts emerged, among which apoptosis (the cell suicidal behaviour that is at the root of multicellular organization into complicated patterns, as well as controlling some protective behaviour against aging and anarchic multiplication) played a central role. An overlooked family of macromolecules, small untranslated RNAs (labelled under a variety of names: sncRNA, ncRNA, suRNA with their avatars, microRNAs, etc.), were suddenly placed in the limelight. As a matter of fact the reason for the lack of interest for RNA molecules other than ribosomal RNAs and tRNAs as essential components of the cell's biology probably comes back to the very origin of the discovery of the lactose operon. Jacob and Monod, in their seminal paper on the control of the lactose operon expression postulated that a regulator should exist, and that this regulator would best be an RNA molecule. It was rapidly shown that the lactose repressor, LacI was in fact a protein, and RNAs, which were continuously discovered, were thought as anecdotal features of gene expression. The discovery of RNA splicing, aptamers, ribozymes, riboswiches, microRNAs and the like placed again RNA at the center of the gene expression a decade ago, and RNA metabolism is now a quite fashionable (and important) research topic.

It is in this context that a few years ago an engineering view of biology, combining genomics and integrative biology (Symplectic Biology, a part of which known as "Systems Biology"), triggered the start of a new revolution, Synthetic Biology. It is in this context that the work of the Unit, Genetics of Bacterial Genomes, will be developing in the next few years.


Summary of the 2003-2006 activity


The revolution of genomics that has recently transformed biology is steadily producing new spectacular discoveries. While the world of mass media tends to concentrate almost exclusively on the “ Human Genome ” project, it is more and more obvious that we will not be able to understand much of this genome if we do not possess a thorough knowledge created by the study of powerful models, microbes in particular. This is the basic reason that drives all the major Genome Centres in the world to develop the study of microbial genomes and this is at the core of the effort coordinated by the European Network of Excellence BioSapiens.

As cases in point, the two major discoveries derived from genome studies in the past fifteen years stemmed from analysis of two microbial genomes: that of Saccharomyces cerevisiae and that of Bacillus subtilis (with a major contribution from the Unit of Regulation of Gene Expression, that predated the present Unit). The first one demonstrated that a fair number of genes are not fixed, belonging to a given organism, but tend to propagate from organism to organism (“ horizontal ” gene transfer, a concept for which we provided strong experimental evidence as early as 1991). The second one was totally unexpected: it showed that a very high proportion of the genes present in a genome, whatever the organism, does not have a known function. This is all the more surprising because we now know more than two thousand genome sequences. In this context, the work in the Unit, first in collaboration with the activity developed by the first program of the HKU-Pasteur Research Centre (created in Hong Kong in 2000), consists in exploring these unknown functions. To this aim, experimental work at the bench is combined with work in silico (using computer programmes), to perform conceptual experiments that serve as references and predictions for experiments performed at the bench. The central conjecture explored in the Unit is to know whether, and if so why genes are not distributed randomly in the chromosomes. It is obvious that the accidents that occur continuously during reproduction lead the genes either to be modified, to disappear, or to change place. One would therefore expect that, after some time, a more or less random distribution of the genes in genomes would prevail. However the very idea that founded conceptual genomics, derived from the idea of the “ genetic programme ”, is that a cell behaves more or less as does a computer, where the machine is truly separated from the data and programmes it works with. However, one knows that a computer is not able to duplicate itself. What else is therefore needed? John von Neumann at the beginning of the 1960's made the hypothesis that, if this were to happen, then one should find somewhere an image of the machine. This drives the quest of the Unit: its scientists try to know whether the cell and its programme are organized structures. In concrete terms, is the order of the genes random in the genome? And, in parallel, where are located, in the cell, the gene products, does one find them everywhere? Finally, what are the driving forces which lead to gene organisation in genomes?

An important part of the work in the Unit therefore consists, on the one hand, in organizing the data making biological knowledge (Ivan Moszer, and construction of the GenoList databases, until the time when he left the Unit to set up a new axis at the Genopole of the Institute, the Technology Platform N°4 at the Genopole, and the BIOSUPPORT programme at the HKU-Pasteur Research Centre), and, on the other hand, to analyse the genome structure (Eduardo Rocha and scientists from the HKU-Pasteur Research Centre). The most surprising discovery made during year 2003 has been that the genes that are essential to the life of bacteria are distributed along the replication leading strand of the DNA double helix, and that this is not directly correlated to a high level of expression. This is accounted for by the absence of conflict between transcription and replication for these genes, because the collisions that occur when the genes are located on the lagging replication strand must often create truncated messenger RNAs, and hence truncated proteins. This discovery further indicates that the products of these genes that are essential under laboratory setting, systematically belong to complexes formed by the association of several proteins, for one would hardly explain the toxicity of a truncated product unless it destroys the complex it is forming (let us think of a building with truncated beams!). In parallel, the Unit participated in the deciphering of the complete genome sequence of three bacteria: Leptospira interrogans (in collaboration with the Genome Centre in Shanghai), bacteria that are particularly dangerous and infect peasants working in rice paddies; Staphylococcus epidermidis (collaboration with the same Centre and Fudan University in Shanghai), bacteria present in the environnement and important for nosocomial infections; and Photorhabdus luminescens (sequenced in the Laboratory of Pathogenic Microorganisms Genomics in the Institute), highly virulent against insects, including against mosquito larvae (Jean-François Charles, Sylviane Derzelle, Evelyne Krin and their coworkers). Our involvement in deciphering genomes led us to participate to the European Union Network of Excellence BioSapiens. Year 2005 was marked by the deciphering of the genome sequence of a bacterium from Antarctica, Pseudoaltermonas haloplanktis TAC125, in collaboration with the Genoscope, the Universities of Liège (Belgium), Naples (Italy) and Stockholm (Sweden). Many new rules of genome and proteome organization were discovered in the Unit during 2004-2006, and we refined the concept of gene essentiality by extrapolating from the way genes are conserved in sequence and in distribution in large ensemble of genomes. “ Persistent ” genes are defining essentiality in natural conditions of life. In the same way, the amino-acid composition of proteins follows rules that tell much about the origin and function of gene products. We propose that aromatic amino acids creates a universal bias in some proteins. Expressed orphan proteins are enriched in these residues, suggesting that they might participate in a process of gain of function during evolution. We postulate that the majority is made of proteins — “ gluons ” — involved in stabilising complexes, thus defining the “ self ” of the species. Finally, as a further demonstration that there is indeed a strong organising principle in genomes, we could proove in collaboration with Massimo Vergassola, a conjecture that we had made a long time ago: the very process of translation is organising bacterial genomes. Indeed genomes are a patchwork of long stretches of DNA where genes are linked by displaying a common codon usage bias.

At this stage, it is of prime importance to understand where the gene products are located inside the cell. The study of uridylate kinase by the group of Anne Marie Gilles is exploring this domain, which is developed in collaboration with the Department of Biochemistry at the University of Hong Kong (Jiandong Huang and his colleagues, within the Procore research program). Another approach, developed in the Unit for several years, is to understand the organisation in the cell of the production of molecules containing sulfur. This is because of the extreme versatility of this atom in terms of oxido-reduction states. The study of sulfur metabolism has therefore been emphasized (Isabelle Martin-Verstraete in Paris and Agnieszka Sekowska in Hong Kong) in particular because the knowledge of this atom was often extremely limited because of the difficulty of the genetic and biochemical studies of sulfur-containing molecules. We have unravelled several new pathways in Bacillus subtilis (one of the two most studied bacteria) and further characterized the methionine salvage pathway that was unfolded in part during the past years.

Finally, year 2003 witnessed the dangerous development of the outbreak of atypical pneumonia (Severe Acute Respiratory Syndrome) and we judged important to participate in the fight against the disease, on the one hand with theoretical studies on the genomes of coronaviruses (at the HKU-Pasteur Research Centre), and on the other hand by an epidemiological model that was meant to get an idea about the origin of the disease and of its development (in collaboration with INRIA and the Department of mathematics at the University of Hong Kong). The model proposed, that of a “double epidemic”, caused by a first innocuous virus, that can mutate in certain patients and lead to the phenomenon of SARS fits well with observations in the field (in particular the large difference between diverse regions in China). This model suggests that the initial virus could stay as an endemic mild pathogen, and might occasionally lead to a resurgence of the disease. It is also interesting in that it suggests that the primary infection caused by the ancestral virus can protect against the disease and proposes that a vaccine could be developped (at least a vaccine with a significant protective effect, if not a long-lasting one). In collaboration with the Shanghai Genome Centre and the Guangdong SARS Consortium we participated to the analysis of the autumn 2003 SARS episode, showing that the virus is still evolving fast in its human and civet cat hosts. The study was published in 2005. To be more directly involved in studies meant to understand pathogenicity the Unit is now a member of the European Network of Excellence EuroPathoGenomics.







Summary of the 1998-2003 activity


The Unit of Genetics of Bacterial Genomes focusses on the large-scale approaches derived from the Bacillus subtilis genome program, previously coordinated by the preceding Unit. Created in 1986, the parent structure, coined Regulation of Gene Expression Unit, analysed the nature of heredity, which stores the information required to generate life, and focused on the determination of gene functions in reference genomes, coupling prediction with computers (experiments in silico) to experiments in vivo. The scientists in the Unit investigated how the thousands of genes in the chromosome of a cell co-operate in an organised manner in an ever-changing environment. Their studies have been guided both by the results of experiments in vivo, which allowed them to identify gene placed progressively higher and higher in the hierarchy of genetic controls of the cell life, and by the spectacular progresses of molecular biology. Two reference micro-organisms were used: Escherichia coli, the most long-standing genetic model, and Bacillus subtilis, a source of numerous enzymes used by industry, often found on the surface of leaves, and abundant in soil. The studies developed in the Unit identify the genes which are critical to the overall adaptation of the bacterium to its environment, and are particularly investigating the metabolism of molecules essential for the cell's construction, that of sulfur and polyamines. Large-scale expression profiling experiments such as two-dimension gel electrophoresis of all the proteins in the bacteria was used to describe the co-variations in the concentrations of particular proteins as a function of the growth status of the bacteria, their environment and changes in the genes studied. Associated to the analysis of the whole set of transcripts together with analysis of the genome sequences, typical of the new field of research now called "genomics", this approach provided a wealth of information. It has shown that there are large groups of genes within the cell that are regulated in the same way. A mass of information is generated by sequencing genomes, and many of the newly identified genes are enigmatic in nature. To contribute to their understanding, molecular genetic studies in the Unit are being complemented by research involving the most up-to-date techniques in computer data management, statistics and mathematics.
Specialized databases have been constructed in collaboration with the University Paris 6 (Atelier de Bioinformatique) and the University of Versailles (they are available on the World-Wide Web: SubtiList, and Indigo). Biological and naturalist aspects of the work are being emphasised, to identify the major functions of the living organisms. In particular, the first analyses of the genomes led to a remarkable observation: the order of genes on the chromosome relates to the cell's architecture. Indeed, the gene order in genomes is not random, and there are experimental hints suggesting that the map of the cell may be directly related to the chromosome structure. The first results of the in vivo, in vitro and in silico investigations aiming at understanding the selection pressure that underlies these architectural constraints suggest the systematic existence of supra-macromolecular complexes. Their components have their genes distributed in a non-uniform way along the chromosome, and they probably constitute structures of 10 to 50 nanometers that form the core of the cell's organization. The work of the Genetics of Bacterial Genomes, backed by work at the HKU-Pasteur Research Centre in Hong Kong has endeavoured to uncover some of the corresponding rules and constraints. Ongoing work there develops new microbial databases and analyses the global structure of the genome's text.

A genome view of the coordination of gene expression

1. The Bacillus subtilis genome sequence (P. Glaser, MF Hullo, with several students for variable periods of time)

In 1995, the very short genomes of two bacteria had been published by TIGR, while the yeast genome sequence was about to be completed. Started almost ten years earlier, the B. subtilis genome program was well under way, and a BIOTECH grant from the European Union was supporting a consortium of European laboratories for completing the sequence, expected to be finished in the end of year 1998. The Japanese consortium was also well on the way. However, it appeared important to speed up our efforts to be present on the international scene at a moment when many laboratories began to be interested in the outcome of genome programs. Together with Frank Kunst, the European coordinator of the program, we decided to speed up the procedure, by involving laboratories which had been part of the yeast genome program in the sequencing effort. This was made somewhat difficult because many regions of B. subtilis DNA, as with all A+T rich Gram positives, are impossible to clone in standard E. coli recipients. We therefore combined standard cloning procedures in a special E. coli strain constructed for this purpose (TP611), cloning into B. subtilis itself, and Long Range PCR (without cloning) for the most difficult regions. This permitted us to possess the complete genome sequence in april 1997, well before the time expected, and to distribute it to the members of the consortium. In addition, some regions where we suspected the presence of errors were distributed to the yeast teams, so that they would be sequenced again, ending with an excellent accuracy (of course, this was not said to the relevant sequencing groups, to avoid useless conflicts inside an exemplary collaborative effort). The complete sequence was presented at the International Bacillus Meeting in Lausanne mid-july 1997, and the sequence was made public in parallel with its publication in november of that same year. At this point a new effort by the same European Japanese consortium, but under the leadership of the Japanese teams, endeavoured to inactivate one by one all the B. subtilis genes. This effort — at present the only one in bacteria — was completed and published in 2003. It is the basis of further analysis that provided the international community with extremely important results, as will be presented below.

2. Data bases and genome annotation platforms (Maude Klaerr-Blanchard, Claudine Médigue, Ivan Moszer, Eduardo Rocha, in collaboration with Louis Jones at the Service Informatique Scientifique, and several laboratories external to the Institut Pasteur de Paris)

Derived from its prototype, Colibri, for the E. coli genome, the sequence and annotation is displayed in the relational database SubtiList, which meets several thousands queries per day, more than two years after the sequence has been published.
Annotating a genome is a never ending process. Indeed, SubtiList is regularly updated, and the last update, just after the Genome 2000 International Meeting at the Institut Pasteur de Paris in April 2000, provided identification for several hundreds new genes. To prevent misannotation and propagating errors, we have assigned a special code name to all genes which have not been explicitely identified by their function (i.e. experimentally, in vivo or in vitro). In agreement with Amos Bairoch (SwissProt), we chose for these gene names that they all begin with a "y" letter. The code we have used follows as closely as possible Demerec's rule for gene nomenclature, despite much discussion from the community of B. subtilis scientists who often stick to old names, often without other reasons than purely anecdotal. We think that harmonizing nomenclature is very important for the future of the genetics of genomes.
Careful annotation asked for an elaborate approach in terms of computer sciences. In collaboration with Alain Hénaut and Jean-Loup Risler from the Université de Versailles Saint-Quentin, François Rechenmann from INRIA Rhône-Alpes and Alain Viari from the Atelier de BioInformatique at the University Paris 6, the Unit created an original strategy allowing genome annotation in silico. In this strategy the concept of "neighborhood" has been favoured as a way to help discovery.
This strategy developed a succession of three relatively independent levels. Each level comprised as generic, and a specific level. The goal for the creation of the process was conceptual. It aimed at the prediction of essential biological functions using the genomic text, together with the associated biological knowledge distributed in scientific publications and data libraries. The precise goal was to identify crucial experiments (to be performed in "wet" laboratories), or to falsify the prediction. They were initially illustrated (see below) in the case of polyamine metabolism.

The three levels of the process were:
o 1. sequence data and annotation management: SubtiList and Colibri
o 2. a platform for sequence annotation: Imagene
o 3. a platform as a help to discover (technique of "neigborhoods"): Indigo

Each level was made of a generic computer software engine, together with specific data. The aim of the process was to define a set of three coupled software engines. Each specific application of the process gave as many valuable results as sequences, annotations or predictions which are created.
1. SubtiList is constructed from an engine for the management of  genomic databases. It is composed of three parts:
1.1. A data scheme structure, GenoList;
1.2. A data base management system (4th Dimension for stand alone applications and Sybase for the WWW database);
1.3. An user interface, eventually with specific procedures for data exploitation (e.g. Blast, Fasta, and other rapid methods for sequence analysis). The interface can be reconstructed knowing simply the World-Wide Web access to it, but this can only be done properly knowing 1.1. This explains why, to our knowledge, there are not yet equivalent bacterial specialized databases.

To construct other specialized databases (SubtiList, Colibri, TubercuList, PyloriGene, etc) it was necessary to introduce sequence data and their annotations in the GenoList engine. This input required a generic procedure. The value of the specialized databases comes from two sides, on the one hand from the genericity of the GenoList engine and of its user-friendly and biologically-oriented construction, and on the other hand from the quality of the sequences (above all, of their annotation). This value diminishes (respectively increases) as time elapses if annotation are not (respectively, are) curated. Curation of a set of annotations allows an important appreciation, in parallel with the creation of a know-how that is extremely difficult to reproduce. Our Unit is curating the B. subtilis annotations. At the HKU-Pasteur Research Centre several other databases were also developed, in particular LeptoList, that corresponds to the Leptospira interrogans genome programme to which we participated [Pdf].

2. Imagene is a generic engine which allows management and strategic organisation of both biological objects (sequences, annotations, images, …) and methods for analysis or management, within the same platform. It is meant to make an in-depth analysis of genomes locally, for a fine description of their properties. It has been validated on the specific example of B. subtilis, by permitting identification of all its coding sequences and regions of transcription termination. It was used to predict regions that carried errors due to the sequencing process. These regions were PCRed out of the chromosome and resequenced by an independent team.
2.1. The platform was constructed in such a way as to allow one to plug-in easily any methods for genome analysis (even methods for which the source code is not available, or methods located far away but available through the Internet).;
2.2. It allows the chaining of methods, the definition of strategies, and, if needed, the ability to go reverse during the chaining of methods;
2.3. It possesses a generic visual interface (APIC) permitting one to start and control the progress of methods and of their results. APIC allows one to superimpose the results of entirely independent methods on the same screen. It permits direct access to their results.

Two special features give added value to Imagene as time elapses. On the one hand the data can be organised in such a way as to construct efficient specalised strategies for genome annotation. On the other hand, the number of methods for analysis that are plugged-in can increase without limitation. If they are accessed through the Internet the engine will know how to start them and recover their results. It will also know how to integrate them into its strategies. As a consequence the rational integration into new strategies of old and new methods will increase with time and use. We can notice, among the methods, the special case of data managenement: it is therefore quite possible to think about plugging-in to Imagene specialised databases. This will be seen by Imagene as a special task, data management. One also can think about creating relationships between sequence and annotation data (neighbourhoods, see Indigo, next paragraph). All these features have been included in a totally recreated structure, Geno* (GenoStar), in collaboration with INRIA, and two companies, GenomExpress and Hybrigenics.

3. Indigo is a prototype platform used as a help for discovery, meant to find difficult to predict neighbours between the various functions related to genes (P. Nitschké, C. Hénaut, in collaboration with P. Guerdoux-Jamet and A. Hénaut).
Indigo is organized in a simple way around a hierarchy of flat files, all centered around gene names, and corresponding to homogenous classes of features (such as codon usage bias, proximity in the chromosome, in metabolism, in isoelectric point of the gene products, in functional class, in literature articles, etc). It is clear that many other types of neighbourhoods should be considered as well, including quite elaborate ones. As an immediate goal for the improvement of Indigo one must create a data structure for the neighbourhood relationships. The published prototype is only meant to demonstrate the feasability of the approach. It illustrates the possibility to make interesting discoveries, even with the limited means allocated at present. Indigo is superficially organised as is GenoList. It possesses an engine (written in Java), that overlaps with the user interface. It is applied to specific data (at the present time E. coli, B. subtilis and Arabidopsis thaliana). One must therefore note that, even more than in the case of specialised databases, the value of a specialised Indigo is directly linked to the quality of the data included. The corresponding information results from annotation steps (statistical analysis for example, that could of course be produced by a strategy included in Imagene), but also from the extraction, at this time manual, of literature neighbours. The creation of appropriate files of this type could rapidly acquire a great value (if they are not publicly available). Finally, because Indigo is a method, it can be, in principle, plugged-in to Imagene.
This set of coordinated approaches has been used to set up an international network financed by the European Science Foundation. This led to the creation of the Genostar consortium.

3. The map of the cell is in the chromosome (I. Moszer, E. Rocha, in collaboration with A. Hénaut and A. Viari)

Knowledge of whole genome sequences is a unique opportunity to study the relationships between gene and gene products at the global level of the cell's architecture. Part of the difficulty of this study comes from the fact that — contrary to a generally accepted intuitive idea — there is often no predictable link between structure and function in biological objects. However, as the outcome of natural selection pressure, there must exist some fitness between gene, gene products and the survival of the organism. This indicates that observing biases in features which would conceptually be thought of as to be unbiased, is the hallmark of some selection pressure.
This prompted us to study global properties of complete genomes. A first analysis on the word content of genome texts suggested that they are not all managed in the same way. We therefore concentrated on long exact repeats, and discovered that, in contrast to what could be expected, the shortest genomes (the Mycoplasmas) had the highest repeat frequency. Also, genomes of comparable sizes such as those of E. coli and B. subtilis have an entirely different way to manage repeats. They are present everywhere in the former genome, while they are very rare, and in close proximity (ca 10 kb) in the latter. In constrast, when we studied the distribution of words, bases or codons in the leading strand as compared to the lagging strand, we made an extremely surprising discovery. There is such a strong bias in one strand as compared to the other (the leading strand is G+T-rich, while the lagging strand is A+C-rich), that the bias is reflected in the amino acid composition of the proteins encoded by each strand (valine-rich for the leading strand, isoleucine+threonine-rich for the lagging strand)! This bias is not present in all genomes (it seems to be absent from genomes of bacteria having an important proportion of membranes, such as the methanogens or the cyanobacteria), but, when present, it is universally the same.
Among other consequences, all these observations tell us that genes do not move as frequently, or as easily as it is often implicitely assumed. There must exist, therefore, constraints in the gene organisation of a chromosome.
Because the genetic code is redundant, coding sequences can be studied by analysing their codon usage. If there were no bias, all codons for a given amino acid should be used more or less equally. In contrast, it has long been observed in E. coli that genes could be split into three classes according to the way they use codons. The same was true for B. subtilis. Yet, random mutations should somehow smooth out differences. This is not the case: indeed, for leucine, where six codons are used, we find that the CUG codon is used more than 70% of the cases in genes that are expressed at a high level during exponential growth conditions, while CUA is expressed in less than 2% of the cases. What is the source of such biases? There might exist a systematic effect of context, some DNA sequences being favoured or selected against. While this could be true for some codons, this cannot be generalized. We know that translation of mRNA into proteins requires the action of transfer RNA adaptor molecules. Because there is less tRNAs specific for a given amino acid than the number of codons, some tRNAs must read several codons. A bias in the concentration of tRNAs might thus result in a bias in codon usage. Therefore we must analyse selection pressure occuring at the level of tRNA synthesis. This is the generally accepted reason to account for the codon usage biases. Unfortunately, two reasons go against this interpretation. Firstly, in much the same way as that there would be all reasons to smooth out biases in codon usage, similar constraints would smooth out biases in tRNA synthesis. For example if a tRNA gene had a strong promoter, spontaneous mutations would tend to lower its efficiency, making transcription of this particular tRNA similar to its other counterparts. This is true, unless there is selection pressure for the converse. The second reason is that, while explanation for the strong bias in a given class of genes could be explained in this way, the same explanation cannot hold for a strong bias in another class of genes. However we know, both from the study of the E. coli and B. subtilis genomes, that two classes of genes display extremely strong, but different biases. And a same tRNA molecule cannot be both expressed at a high level, and not expressed at a high level…
This requires looking for another explanation. The cytoplasm of a cell is not a tiny test tube. One of the most puzzling feature of the organisation of the cell cytoplasm is that it must accomodate the presence of a very long thread molecule, DNA, and that this molecule must be transcribed as a multitude of RNA threads that usually have a length of the same order of magnitude as the length of the whole cell. This asks for some organisation of transcription, translation and replication so that mRNA molecules and DNA are not mixed up together all the time. The volume occupied by a ribosome is a cube with an 200 Å edge. In an E. coli cell growing exponentially in a rich medium there are at least 15,000 ribosomes. Thus, the fraction of the cell volume occupied by ribosomes is at least 12 %. The actual volume of the cell free of ribosomes is in fact significantly smaller if one takes into account the volume occupied by the chromosome and by the transcription and the replication machineries. If one now counts that the translation machinery asks for an appropriate pool of elongation factors, tRNA synthetases and tRNAs, it becomes clear that the cytoplasm behaves like a gel. In addition, simply counting the number of tRNA molecules sitting around a ribosome, it appears that one cannot speak about the concentration of such molecules, but only about a small, finite number. Compartmentalisation has been demonstrated to be important even for small molecules, despite the fact that they could diffuse quickly. As a consequence, a translating ribosome acts as an attractor of a certain pool of tRNA molecules. In such a case diffusion should only be considered locally. The cytoplasm becomes therefore a ribosome lattice, displaying relatively slow movements with respect to local diffusion of small molecules as well as macromolecules. This provides an efficient selection pressure leading to adaptation of the codon usage of the translated message as a function of its position in the cell's cytoplasm. If the codon usage changes from mRNA to mRNA, this indicates that these different molecules do not see the same ribosomes in the usual life cycle of the organism. In particular if two genes have very different codon usage this indicates that the corresponding mRNAs are not made from the same part of the cell (it is indeed difficult to see how ribosomes sitting next to each other could attract different tRNA molecules).
Several models of transcription account for a process where the transcribed regions are present at the surface of the chromoid, so that RNA polymerase does not have to circle the double helix it is unwinding and transcribing. Thus mRNA threads, usually structured at their 5' end, are pulled off DNA by the lattice of ribosomes, going from one ribosome to the next one, as does a thread in a wiredrawing machine (this is exactly the opposite view of textbooks translation, where ribosomes are supposed to travel along fixed mRNA molecules). In this process a nascent protein is synthesized on each ribosome, spread throughout the cytoplasm by the linear diffusion of the mRNA molecule from one ribosome to the next one, avoiding the requirement for the much slower 3D diffusion of the protein. Polycistronic operons ensure that proteins with related functions are co-expressed locally, permitting channelling of the corresponding substrates and products. It seems likely that the structure of mRNA molecules is coupled to their fate in the cell, and to their function in compartmentalisation. The fate of mRNA is therefore an important feature of gene regulation. We have therefore investigated the degradation process of mRNAs, comparing data extracted from the genomes of B. subtilis and E. coli. This led us to identify a main function of the elusive enzyme polynucleotide phosphorylase, as producing CDP needed for DNA synthesis, thus coupling translation, transcriptiona and replication together. If we consider genes translated sequentially in operons as physiologically and structurally relevant, we should also analyse mRNAs that are translated parallel to each other. Indeed if there is correlation of function and/or localisation in one dimension, there should also exist a similar constraint in the orthogonal directions. How would this be seen? This is where codon usage comes again. Indeed if ribosomes act as attractors of tRNA molecules, this implies a local coupling between these molecules and the codons they can use in the message they read. Obviously, this requires that the same ribosome mostly translates mRNAs having similar codon usage. This has the consequence that as one goes away from a strongly biased ribosome, there is less and less availability of the most biased tRNAs. In turn, there would be selection pressure for a gradient of codon usage bias as one goes away from the most biased messages and ribosomes. Transcripts are nested around central core(s), formed of transcripts for highly biased genes. This fits with what is seen of the general organisation of genes in the chromosome. In particular this agrees with the observation that the distance between E. coli genes oriented in the same direction on the chromosome is positively correlated to the expression level of the downstream gene.
Finally, the chromosomes must separate from each other and migrate in each of the daughter cells. There must exist some kind of repulsive force that pushes DNA strands away from each other. While there are probably gene products involved in this process, ribosome synthesis, in particular from regions near the origin of replication, performs exactly what is needed, by continuously creating new ribosomes. Continuous synthesis of ribosomes in between the replicating forks would also provide a mechanical stress on the bacterial wall in the middle of the cell. Koch has convincingly argued that the bacterial wall is indeed a stress-bearing fabric. If ribosome sources are organisers of the cell, mRNA for genes highly expressed under exponential growth conditions should be located near the center of these organisers, while other mRNAs should be translated in nested layers, all the way to the ribosomes that are located near the cytoplasmic membrane, and that would be involved in cotranslational membrane protein localisation. Organisation of the genes in the chromosome should therefore show regularities that are linked to this architecture, as we have indeed observed. This gives us strong reasons to propose that genes along the chromosome specify the map of the cell, a kind of celluloculus.

A geneticist's view: master genes and intermediary metabolism

1. Cyclic AMP and adenylate cyclases: the discovery of a fourth cyclase class (M.-P. Coudart-Cavalli, P. Trotot, P. Biville, O. Sismeiro)

Cyclic AMP is a mediator of catabolite repression in bacteria. Curiously, despite the interest for this important process, not much was known on the rather elusive enzymes, adenylate cyclases, which make this molecule from ATP. In 1996, the work in the Unit had already discovered three main classes of these enzymes, which were apparently unrelated phylogenetically. Very remarkably, this work demonstrated that Gram negative bacteria could differ in the nature of the adenylate cyclase they harboured: enterobacteria had one type, while myxobacteria, or rhizobia had another type (a more ancestral form, presumably, since it is phylogenetically similar to the enzymes found in Eukarya). In the course of a screening for adenylate cyclases in bacteria related to enterobacteria, but differing from them, we made the surprizing discovery that A. hydrophila harboured a fourth adenylate cyclase type, an enzyme much related to proteins found in Archaea. This protein was found in all species of A. hydrophila investigated, but not in other Aeromonas sp. The counterpart of the gene was found in the Y. pestis genome, and shown to express adenylate cyclase activity (unpublished). The reason for this extraordinary variety in adenylate cyclases in not known.

2. Global analysis of the H-NS protein function (P. Bertin, F. Hommais, O. Soutourina, C. Tendeng and several trainees)

To study the global regulation of bacterial metabolism, in particular in pathogenic microorganisms, we used the hns mutation in Escherichia coli as a reference system. Indeed, the H-NS protein is known to be involved in numerous fonctions in the cell and to affect the expression of genes regulated by environmental factors (temperature, osmolarity, ...). Three main topics have been developped since 1996.
Motility and/or flagellum biosynthesis have been frequently associated with virulence in various microorganisms. In enterobacteria, this process requires the expression of numerous genes scattered on the chromosome and organised in an ordered cascade. The fliC mRNA coding for flagellin and the FliC protein itself are absent in an hns mutant, which results in a loss of motility. Moreover, using transcriptional fusions, we showed that an hns mutation results in a 3-fold decreased expression of flhDC, the master operon which controls all other flagellar genes. This was the first example of positive control by H-NS so far described. Similar observations were made in a crp mutant, providing evidence that, like H-NS, the cAMP/CAP complex plays a role of activator on flagellar gene expression. To know whether these regulators could affect flhDC expression by interacting with its promoter, we performed gel shift experiments using purified proteins. The results demonstrated that the flhDC promoter region is preferentially retarded in the presence of H-NS or CAP. Moreover, DNAse footprinting experiments allowed us to determine precisely their binding sites on the flhDC regulatory region. In vitro transcription assays were performed in collaboration with S. Rimsky and A. Kolb (Unité de Physico-Chimie des Macromolécules Biologiques). Surprisingly, H-NS seems to repress flhDCtranscription while the cAMP/CAP complex activates its expression. Finally, in a crp mutant, motility is restored in the presence of wild-type CAP protein but not in the presence of protein mutated in region I involved in the interaction with RNA polymerase. This suggests that the cAMP/CAP complex positively regulates flagellum synthesis by a direct interaction with the C-terminal part of the RNA polymerase a subunit. In contrast, the binding of H-NS to the same region cannot explain its positive control observed in vivo on flagellum synthesis. In this respect, the existence of a long non-coding region between the +1 transcriptional start site and the ATG translational codon seems to play a crucial role in the control of the master operon by H-NS. Finally, to know whether a similar mechanism of flhDC regulation could be extrapolated to other organisms, we analysed the promoter region of an homologous operon recently identified in Photorhabdus luminescens, using a method allowing direct determination of the  nucleotide sequence from genomic DNA. Our results demonstrated the presence of a cAMP/CAP binding site and of a non-translated region (unpublished observations). This suggests that, in this organism, the mechanism of flhDC regulation could be similar to that in E. coli.
The pleiotropic effect of the hns mutation led us to analyse the role of H-NS on bacterial physiology using large scale technologies. In collaboration with C. Laurent-Winter (Laboratoire de Physico-Chimie des Macromolécules) and J.P. LeCaer (Laboratoire de Neurobiologie et Diversité Cellulaire, ESPCI, Paris), we demonstrated that the synthesis and/or the accumulation of about 60 proteins was specifically altered in an hns mutant on two-dimension gel electrophoresis. Many of them were identified by microsequencing or by mass spectrometry. They are found to be involved in bacterial response to various stresses (pH, osmolarity, ...). Moreover, to study the global effect of H-NS on gene expression in E. coli, we analysed, in collaboration with A. Malpertuy (Unité de Génétique Moléculaire des Levures), the transcriptome of an hns strain using DNA arrays. These experiments showed that the expression level of 200 genes was modified in a mutant strain (unpublished). Again, most of them are known to be involved in stress response. In particular, the high expression level of several genes induced by high osmolarity or low pH resulted in a strong increased resistance to both stresses in the hns strain. Moreover, many H-NS target genes with unknown function were predicted to encode fimbriae which could play a major role in virulence processes. These observations provide evidence that an hns mutation cannot be simply considered as a loss of function but can provide a selective advantage to the cell with respect to some stressful conditions. Finally, these observations suggest that the main role of hns could be to control the proton availability in the periplasm of many gran-negative bacteria.
Until recently, H-NS had been only characterised in enterobacteria. In collaboration with S. Goyard (Unité de Biochimie des Régulations Cellulaires), an H-NS-like protein was identified in Bordetella pertussis, the aetiological agent of whooping-cough. Its structural gene was isolated and sequenced. Its product showed a significant similarity with H-NS, in particular in the C-terminal domain. Moreover, the screening of databases allowed us to identify a related protein in Rhodobacter capsulatus. In silico analysis of their amino acid sequence (secondary structure prediction,  presence of hydrophobic clusters, ...) in collaboration with R. Brasseur (Centre de Biophysique Moléculaire Numérique, Gembloux, Belgium) suggests that these proteins are structurally related. Moreover, amino acid sequence alignment demonstrated the existence of a consensus in their DNA binding domain. The structural gene of these proteins was cloned after PCR amplification and proteins were expressed in an hns strain of E. coli. These experiments showed that all proteins are able to complement the phenotypic alterations in such a strain (loss of motility, reduction in growth rate, serine susceptibility, ...). Gel retardation experiments performed with purified proteins revealed a preferential binding to curved DNA similar to that of H-NS. Cross-linking experiments showed that, despite a low amino acid conservation in their N-terminal domain, these proteins are able to dimerise in vitro. These observations are the first demonstration that proteins structurally and functionnally related to H-NS are widespread in Gram-negative bacteria. Moreover, by complementation of the serine susceptibility of hns mutants in E. coli, we recently isolated and characterised an hns-like gene in Vibrio cholerae, the agent of cholera disease. Similarily, in collaboration with P. Glaser (Laboratoire de Génomique des Microorganismes Pathogènes), we identified two H-NS-like proteins in P. luminescens, an entomopathogenic bacterium whose genome sequencing is currently in progress at the Pasteur Institute. These results further supports the existence of a large family of H-NS-like proteins in microorganisms.

3. Pyrophosphate effects on Escherichia coli: a link with iron metabolism (F. Biville, E. Turlin, M. Perrotte, C.-K. Wun, and several trainees)

In the course of the study of cAMP synthesis in E. coli, the effect of pyrophosphate, a product of the reaction producing cAMP from ATP was investigated. A first series of experiments demonstrated that, in a phosphate-rich minimal medium pyrophosphate had a surprising stimulating growth effect. This effect resulted in a significant modification of the expressed proteome pattern of the cells. This could not be due to a phosphate starvation, and the first hypothesis which came to mind was that energy from the energy-rich bond of the molecule was somehow recovered by the cell. However all experiments meant to explore this hypothesis were unsuccessful. In particular the non hydolysable analog methylene diphosphate had an effect similar to that of pyrophosphate. Analysis of the metabolic activities which varied upon pyrophosphate addition suggested that the tricarboxylic acid cycle was somehow involved. Further exploration demonstrated that the pyrophosphate effect is mimicked by addition of excess iron to the medium. This demonstrated first that, even in a medium supplemented by 5 mM iron, there is still some iron deficiency in a phosphate rich minimum medium, and, second, that the pyrophosphate molecule somehow helps the cell to scavenge existing iron in the environment in a way which permit it to strive on a low iron level (M. Perrotte thesis). Work in progress demonstrates that a phosphorelay system (two-component regulator) of unknown function is involved in this process. When unraveled this will add interesting information on a set of genes of unknown function in the genome of E. coli and will contribute to improve its annotation.

4. Functional analysis of the B. subtilis genome: polyamines and sulfur metabolism (JY Coppée, P. Glaser, M.-F. Hullo, I. Martin-Verstraete, E. Presecan, A. Sekowska, C.-K.  Wun)

Among the aims of genomes functional analysis is the possibility to rapidly reconstruct entire metabolic pathways. This cannot be done using in silico analysis alone, because many proteins have a common descent. This results in the fact that related activities often share similar sequences (e.g. a decarboxylase specific for a given amino acid must be similar to its counterpart specific for another amino acid). We have therefore constructed relatively rapid tests on plates with molecules or ions that could help us to trace as efficiently as possible genes involved in integrated metabolic pathways. Amino acid metabolism is not well described in B. subtilis, and although quite a few gene similarities point to expected enzyme activities, it is necessary to validate the hypotheses derived from these similarities. We used amino acid analogs or certain types of antibiotics is a way to achieve this goal. In addition, we set up several growth condition tests (in particular for swarming or gliding on plates) to test for more subtle phenotypes (A. Sekowska, thesis dissertation).
In the course of this systematic analysis, we remarked the importance of intermediary metabolism activities. In particular, polyamines, although dispensable under routinely used laboratory growth conditions, are extremely important for the cell. They are involved in macromolecular syntheses, and in particular in modulating the accuracy of translation, at steps which may be essential for survival of the cell populations. Their importance is reflected by the fact that their biosynthesis is energy costly. This is especially true for the larger molecules, such as spermidine, spermine and their analogues. In particular, spermidine synthesis requires S-adenosylmethionine (AdoMet) as a precursor. Surprisingly, AdoMet is not used as such in the reaction but is first decarboxylated to 3-aminopropyl-S-adenosine (dAdoMet). The aminopropyl- moiety of the substrate is subsequently transferred onto one of the amino-terminal ends of putrescine, to generate spermidine. A further transfer on spermidine yields spermine in some organisms.
Transamination and decarboxylation are ubiquitous steps in intermediary metabolism. They are generally achieved by enzymes carrying pyridoxal phosphate as a co-enzyme. However, a noteworthy feature of the known AdoMet decarboxylation reaction is that it is achieved by an enzyme carrying not a pyridoxal but a pyruvoyl group as the catalytic residue. Pyruvoyl enzymes perform a limited number of varied decarboxylation reactions; comprising the decarboxylation of AdoMet in Eukarya and Gram-negative bacteria. Combining gene disruption experiments and biochemical identification of polyamines, we unravelled the main features of polyamine biosynthesis in B. subtilis, showing that the predominant pathway proceeds from arginine via agmatine. We also observed that, in contrast to E. coli, B. subtilis does not maintain a significant intracellular pool of putrescine under conditions where the level of spermidine is similar to that found in E. coli. We further identified the pathway leading to the addition of an N-propylamine group to putrescine, creating spermidine. This reaction yields the sulfur-rich molecule, methylthioadenosine (MTA) as a by-product. We identified the nucleosidase encoded by the mtn (yrrU) gene as the first enzyme implicated in its recycling. By gene disruption, in vitro mutagenesis, cell-free protein synthesis and biochemical analysis of polyamines, we showed that the unknown gene ytcF, renamed speD, codes for the decarboxylase. Analysis of the phylogenetic relationships among bacterial enzymes demonstrated that the B. subtilis enzyme is very similar to several predicted proteins of unknown function from Archaea. The MJ0315 gene, which presumably encodes an AdoMet decarboxylase of Methanococcus jannaschii, was used to complement B. subtilis ytcF and E. coli speD mutants and was expressed in a cell free system and we could thus identify for the first time the nature of the corresponding gene and protein in Archaea.
While the number of genome sequences increases exponentially it remains difficult to identify gene functions explicitely. Automatic annotation procedures rest mostly on sequence comparisons. They are used to build up phylogeny trees, where reference activities are assumed to spread to neighbours by contiguity. The corresponding functions are thus described tentatively as identical to that of the known reference. However, these methods do not address the central question of enzyme recruitment for new activities. Furthermore, genes and proteins are not simply sequences of letters, they are made from chemicals deriving from the cell metabolism, and a single gene alteration may result in a general base or amino acid content bias, changing the "style" of an organism, possibly altering its place in calculated phylogenies, thus leading to wrong assignments in enzyme activities. Ouzounis and Kyprides constructed an interesting evolutionary tree of agmatinases, with emphasis on their universal presence. Since this seminal work, many new sequences have been obtained and annotated by their similarity with the known sequences. We undertook a comparative analysis of the corresponding set of sequences. Genes that were deemed important were cloned and attempts were made to identify their functions. We first considered the usual types of phylogeny trees constructed on the variation of the amino acid sequence in these proteins, without taking into account the presence of gaps in the sequences. Several discrepancies with respect to the expected position of some organisms in the trees were found. In a second approach, we reconstructed trees based only on the presence and evolution of gap-containing regions in the sequences, because gaps would be much less sensitive to genetic drift or amino acid metabolism. The crucial enzyme activities that presumably evolved from ancestral ureohydrolases were validated by cloning, expressing and measuring activity of the corresponding enzymes. The emerging picture is consistent with a bacterial origin of hydrolases (ureohydrolases and related activities), which later evolved to those of the Archaea and the Eukarya. Our experiments therefore validate the use of gap-trees in the prediction of gene function.
All this work prompted us to analyse the related metabolism of sulfur (A. Sekowska, thesis dissertation, review PDF), still poorly described in most organisms, and this has been a central area of the research in functional genomics developed in the next few years both in Paris and at the HKU-Pasteur Research Centre.