Genetics of Bacterial Genomes
Three major revolutions marked the development of Biology in the past decades. Molecular biology, based on the study of microbes and their viruses, the bacteriophages, uncovered the rules of gene expression, associated to the way expression is controlled in time and in space. Replication of DNA, transcription into messenger RNA molecules followed by translation into proteins, using the universal genetic code unified our view of what life is. In parallel, extraordinary technological developments in computer power associated to the sudden access to the very nature of genes by sequencing triggered the onset of the genomics revolution, just two decades ago. While microbes still play a seminal role there, much emphasis was placed on the Human Genome, and on analyses based on the study of nucleated cells. Simple eukaryotes such as budding and fission yeasts, as well as the old Drosophila model, associated to the new nematode model, Caenorhabditis elegans, entirely renewed our way of considering the ontological development of multicellular organisms. All the work pertaining to these objects concentrated on concepts of regulation cascades (thousands of works are devoted to cyclic AMP, protein phosphorylation, central regulators such as NFkappaB...) showing that Science is much the subject of fashion, as can be (unfortunately?) expected from a community made of some 2.5 million persons. In this context a few new concepts emerged, among which apoptosis (the cell suicidal behaviour that is at the root of multicellular organization into complicated patterns, as well as controlling some protective behaviour against aging and anarchic multiplication) played a central role. An overlooked family of macromolecules, small untranslated RNAs (labelled under a variety of names: sncRNA, ncRNA, suRNA with their avatars, microRNAs, etc.), were suddenly placed in the limelight. As a matter of fact the reason for the lack of interest for RNA molecules other than ribosomal RNAs and tRNAs as essential components of the cell's biology probably comes back to the very origin of the discovery of the lactose operon. Jacob and Monod, in their seminal paper on the control of the lactose operon expression postulated that a regulator should exist, and that this regulator would best be an RNA molecule. It was rapidly shown that the lactose repressor, LacI was in fact a protein, and RNAs, which were continuously discovered, were thought as anecdotal features of gene expression. The discovery of RNA splicing, aptamers, ribozymes, riboswiches, microRNAs and the like placed again RNA at the center of the gene expression a decade ago, and RNA metabolism is now a quite fashionable (and important) research topic.
It is in this context that a few years ago an engineering view of biology, combining genomics and integrative biology (Symplectic Biology, a part of which known as "Systems Biology"), triggered the start of a new revolution, Synthetic Biology. It is in this context that the work of the Unit, Genetics of Bacterial Genomes, will be developing in the next few years.
Summary of the 2003-2006 activity
The revolution of genomics that has recently transformed biology is steadily producing new spectacular discoveries. While the world of mass media tends to concentrate almost exclusively on the “ Human Genome ” project, it is more and more obvious that we will not be able to understand much of this genome if we do not possess a thorough knowledge created by the study of powerful models, microbes in particular. This is the basic reason that drives all the major Genome Centres in the world to develop the study of microbial genomes and this is at the core of the effort coordinated by the European Network of Excellence BioSapiens.
As cases in point, the two major discoveries derived from genome studies in the past fifteen years stemmed from analysis of two microbial genomes: that of Saccharomyces cerevisiae and that of Bacillus subtilis (with a major contribution from the Unit of Regulation of Gene Expression, that predated the present Unit). The first one demonstrated that a fair number of genes are not fixed, belonging to a given organism, but tend to propagate from organism to organism (“ horizontal ” gene transfer, a concept for which we provided strong experimental evidence as early as 1991). The second one was totally unexpected: it showed that a very high proportion of the genes present in a genome, whatever the organism, does not have a known function. This is all the more surprising because we now know more than two thousand genome sequences. In this context, the work in the Unit, first in collaboration with the activity developed by the first program of the HKU-Pasteur Research Centre (created in Hong Kong in 2000), consists in exploring these unknown functions. To this aim, experimental work at the bench is combined with work in silico (using computer programmes), to perform conceptual experiments that serve as references and predictions for experiments performed at the bench. The central conjecture explored in the Unit is to know whether, and if so why genes are not distributed randomly in the chromosomes. It is obvious that the accidents that occur continuously during reproduction lead the genes either to be modified, to disappear, or to change place. One would therefore expect that, after some time, a more or less random distribution of the genes in genomes would prevail. However the very idea that founded conceptual genomics, derived from the idea of the “ genetic programme ”, is that a cell behaves more or less as does a computer, where the machine is truly separated from the data and programmes it works with. However, one knows that a computer is not able to duplicate itself. What else is therefore needed? John von Neumann at the beginning of the 1960's made the hypothesis that, if this were to happen, then one should find somewhere an image of the machine. This drives the quest of the Unit: its scientists try to know whether the cell and its programme are organized structures. In concrete terms, is the order of the genes random in the genome? And, in parallel, where are located, in the cell, the gene products, does one find them everywhere? Finally, what are the driving forces which lead to gene organisation in genomes?
An important part of the work in the Unit therefore consists, on the one hand, in organizing the data making biological knowledge (Ivan Moszer, and construction of the GenoList databases, until the time when he left the Unit to set up a new axis at the Genopole of the Institute, the Technology Platform N°4 at the Genopole, and the BIOSUPPORT programme at the HKU-Pasteur Research Centre), and, on the other hand, to analyse the genome structure (Eduardo Rocha and scientists from the HKU-Pasteur Research Centre). The most surprising discovery made during year 2003 has been that the genes that are essential to the life of bacteria are distributed along the replication leading strand of the DNA double helix, and that this is not directly correlated to a high level of expression. This is accounted for by the absence of conflict between transcription and replication for these genes, because the collisions that occur when the genes are located on the lagging replication strand must often create truncated messenger RNAs, and hence truncated proteins. This discovery further indicates that the products of these genes that are essential under laboratory setting, systematically belong to complexes formed by the association of several proteins, for one would hardly explain the toxicity of a truncated product unless it destroys the complex it is forming (let us think of a building with truncated beams!). In parallel, the Unit participated in the deciphering of the complete genome sequence of three bacteria: Leptospira interrogans (in collaboration with the Genome Centre in Shanghai), bacteria that are particularly dangerous and infect peasants working in rice paddies; Staphylococcus epidermidis (collaboration with the same Centre and Fudan University in Shanghai), bacteria present in the environnement and important for nosocomial infections; and Photorhabdus luminescens (sequenced in the Laboratory of Pathogenic Microorganisms Genomics in the Institute), highly virulent against insects, including against mosquito larvae (Jean-François Charles, Sylviane Derzelle, Evelyne Krin and their coworkers). Our involvement in deciphering genomes led us to participate to the European Union Network of Excellence BioSapiens. Year 2005 was marked by the deciphering of the genome sequence of a bacterium from Antarctica, Pseudoaltermonas haloplanktis TAC125, in collaboration with the Genoscope, the Universities of Liège (Belgium), Naples (Italy) and Stockholm (Sweden). Many new rules of genome and proteome organization were discovered in the Unit during 2004-2006, and we refined the concept of gene essentiality by extrapolating from the way genes are conserved in sequence and in distribution in large ensemble of genomes. “ Persistent ” genes are defining essentiality in natural conditions of life. In the same way, the amino-acid composition of proteins follows rules that tell much about the origin and function of gene products. We propose that aromatic amino acids creates a universal bias in some proteins. Expressed orphan proteins are enriched in these residues, suggesting that they might participate in a process of gain of function during evolution. We postulate that the majority is made of proteins — “ gluons ” — involved in stabilising complexes, thus defining the “ self ” of the species. Finally, as a further demonstration that there is indeed a strong organising principle in genomes, we could proove in collaboration with Massimo Vergassola, a conjecture that we had made a long time ago: the very process of translation is organising bacterial genomes. Indeed genomes are a patchwork of long stretches of DNA where genes are linked by displaying a common codon usage bias.
At this stage, it is of prime importance to understand where the gene products are located inside the cell. The study of uridylate kinase by the group of Anne Marie Gilles is exploring this domain, which is developed in collaboration with the Department of Biochemistry at the University of Hong Kong (Jiandong Huang and his colleagues, within the Procore research program). Another approach, developed in the Unit for several years, is to understand the organisation in the cell of the production of molecules containing sulfur. This is because of the extreme versatility of this atom in terms of oxido-reduction states. The study of sulfur metabolism has therefore been emphasized (Isabelle Martin-Verstraete in Paris and Agnieszka Sekowska in Hong Kong) in particular because the knowledge of this atom was often extremely limited because of the difficulty of the genetic and biochemical studies of sulfur-containing molecules. We have unravelled several new pathways in Bacillus subtilis (one of the two most studied bacteria) and further characterized the methionine salvage pathway that was unfolded in part during the past years.
Finally, year 2003 witnessed the dangerous development of the outbreak of atypical pneumonia (Severe Acute Respiratory Syndrome) and we judged important to participate in the fight against the disease, on the one hand with theoretical studies on the genomes of coronaviruses (at the HKU-Pasteur Research Centre), and on the other hand by an epidemiological model that was meant to get an idea about the origin of the disease and of its development (in collaboration with INRIA and the Department of mathematics at the University of Hong Kong). The model proposed, that of a “double epidemic”, caused by a first innocuous virus, that can mutate in certain patients and lead to the phenomenon of SARS fits well with observations in the field (in particular the large difference between diverse regions in China). This model suggests that the initial virus could stay as an endemic mild pathogen, and might occasionally lead to a resurgence of the disease. It is also interesting in that it suggests that the primary infection caused by the ancestral virus can protect against the disease and proposes that a vaccine could be developped (at least a vaccine with a significant protective effect, if not a long-lasting one). In collaboration with the Shanghai Genome Centre and the Guangdong SARS Consortium we participated to the analysis of the autumn 2003 SARS episode, showing that the virus is still evolving fast in its human and civet cat hosts. The study was published in 2005. To be more directly involved in studies meant to understand pathogenicity the Unit is now a member of the European Network of Excellence EuroPathoGenomics.
Summary of the 1998-2003 activity
The Unit of Genetics of Bacterial Genomes focusses on the large-scale
approaches derived from the Bacillus subtilis genome program, previously
coordinated by the preceding Unit. Created in 1986, the parent structure,
coined Regulation of Gene Expression Unit, analysed the nature of heredity,
which stores the information required to generate life, and focused on
the determination of gene functions in reference genomes, coupling prediction
with computers (experiments in silico) to experiments in vivo.
The scientists in the Unit investigated how the thousands of genes in
the chromosome of a cell co-operate in an organised manner in an ever-changing
environment. Their studies have been guided both by the results of experiments
in vivo, which allowed them to identify gene placed progressively higher
and higher in the hierarchy of genetic controls of the cell life, and
by the spectacular progresses of molecular biology. Two reference micro-organisms
were used: Escherichia coli, the most long-standing genetic model,
and Bacillus subtilis, a source of numerous enzymes used by industry,
often found on the surface of leaves, and abundant in soil. The studies
developed in the Unit identify the genes which are critical to the overall
adaptation of the bacterium to its environment, and are particularly investigating
the metabolism of molecules essential for the cell's construction, that
of sulfur and polyamines. Large-scale expression profiling experiments
such as two-dimension gel electrophoresis of all the proteins in the bacteria
was used to describe the co-variations in the concentrations of particular
proteins as a function of the growth status of the bacteria, their environment
and changes in the genes studied. Associated to the analysis of the whole
set of transcripts together with analysis of the genome sequences, typical
of the new field of research now called "genomics", this approach
provided a wealth of information. It has shown that there are large groups
of genes within the cell that are regulated in the same way. A mass of
information is generated by sequencing genomes, and many of the newly
identified genes are enigmatic in nature. To contribute to their understanding,
molecular genetic studies in the Unit are being complemented by research
involving the most up-to-date techniques in computer data management,
statistics and mathematics.
A genome view of the coordination of gene expression
1. The Bacillus subtilis genome sequence (P. Glaser, MF Hullo, with several students for variable periods of time)
In 1995, the very short genomes of two bacteria had been published by TIGR, while the yeast genome sequence was about to be completed. Started almost ten years earlier, the B. subtilis genome program was well under way, and a BIOTECH grant from the European Union was supporting a consortium of European laboratories for completing the sequence, expected to be finished in the end of year 1998. The Japanese consortium was also well on the way. However, it appeared important to speed up our efforts to be present on the international scene at a moment when many laboratories began to be interested in the outcome of genome programs. Together with Frank Kunst, the European coordinator of the program, we decided to speed up the procedure, by involving laboratories which had been part of the yeast genome program in the sequencing effort. This was made somewhat difficult because many regions of B. subtilis DNA, as with all A+T rich Gram positives, are impossible to clone in standard E. coli recipients. We therefore combined standard cloning procedures in a special E. coli strain constructed for this purpose (TP611), cloning into B. subtilis itself, and Long Range PCR (without cloning) for the most difficult regions. This permitted us to possess the complete genome sequence in april 1997, well before the time expected, and to distribute it to the members of the consortium. In addition, some regions where we suspected the presence of errors were distributed to the yeast teams, so that they would be sequenced again, ending with an excellent accuracy (of course, this was not said to the relevant sequencing groups, to avoid useless conflicts inside an exemplary collaborative effort). The complete sequence was presented at the International Bacillus Meeting in Lausanne mid-july 1997, and the sequence was made public in parallel with its publication in november of that same year. At this point a new effort by the same European Japanese consortium, but under the leadership of the Japanese teams, endeavoured to inactivate one by one all the B. subtilis genes. This effort at present the only one in bacteria was completed and published in 2003. It is the basis of further analysis that provided the international community with extremely important results, as will be presented below.
2. Data bases and genome annotation platforms (Maude Klaerr-Blanchard, Claudine Médigue, Ivan Moszer, Eduardo Rocha, in collaboration with Louis Jones at the Service Informatique Scientifique, and several laboratories external to the Institut Pasteur de Paris)
Derived from its prototype, Colibri,
for the E. coli genome, the sequence and annotation is displayed
in the relational database SubtiList,
which meets several thousands queries per day, more than two years after
the sequence has been published.
The three levels of the process were:
Each level was made of a generic computer software engine, together with
specific data. The aim of the process was to define a set of three coupled
software engines. Each specific application of the process gave as many
valuable results as sequences, annotations or predictions which are created.
To construct other specialized databases (SubtiList, Colibri, TubercuList, PyloriGene, etc) it was necessary to introduce sequence data and their annotations in the GenoList engine. This input required a generic procedure. The value of the specialized databases comes from two sides, on the one hand from the genericity of the GenoList engine and of its user-friendly and biologically-oriented construction, and on the other hand from the quality of the sequences (above all, of their annotation). This value diminishes (respectively increases) as time elapses if annotation are not (respectively, are) curated. Curation of a set of annotations allows an important appreciation, in parallel with the creation of a know-how that is extremely difficult to reproduce. Our Unit is curating the B. subtilis annotations. At the HKU-Pasteur Research Centre several other databases were also developed, in particular LeptoList, that corresponds to the Leptospira interrogans genome programme to which we participated [Pdf].
is a generic engine which allows management and strategic organisation
of both biological objects (sequences, annotations, images,
methods for analysis or management, within the same platform. It is meant
to make an in-depth analysis of genomes locally, for a fine description
of their properties. It has been validated on the specific example of
B. subtilis, by permitting identification of all its coding sequences
and regions of transcription termination. It was used
to predict regions that carried errors due to the sequencing process.
These regions were PCRed out of the chromosome and resequenced by an independent
Two special features give added value to Imagene as time elapses. On the one hand the data can be organised in such a way as to construct efficient specalised strategies for genome annotation. On the other hand, the number of methods for analysis that are plugged-in can increase without limitation. If they are accessed through the Internet the engine will know how to start them and recover their results. It will also know how to integrate them into its strategies. As a consequence the rational integration into new strategies of old and new methods will increase with time and use. We can notice, among the methods, the special case of data managenement: it is therefore quite possible to think about plugging-in to Imagene specialised databases. This will be seen by Imagene as a special task, data management. One also can think about creating relationships between sequence and annotation data (neighbourhoods, see Indigo, next paragraph). All these features have been included in a totally recreated structure, Geno* (GenoStar), in collaboration with INRIA, and two companies, GenomExpress and Hybrigenics.
3. Indigo is a prototype
platform used as a help
for discovery, meant to find difficult to predict neighbours between
the various functions related to genes (P. Nitschké, C. Hénaut,
in collaboration with P. Guerdoux-Jamet and A. Hénaut).
3. The map of the cell is in the chromosome (I. Moszer, E. Rocha, in collaboration with A. Hénaut and A. Viari)
Knowledge of whole genome sequences is a unique opportunity to study
the relationships between gene and gene products at the global level of
the cell's architecture. Part of the difficulty of this study comes from
the fact that contrary to a generally accepted intuitive idea
there is often no predictable link between structure
and function in biological objects. However, as the outcome of natural
selection pressure, there must exist some fitness between gene, gene products
and the survival of the organism. This indicates that observing biases
in features which would conceptually be thought of as to be unbiased,
is the hallmark of some selection pressure.
A geneticist's view: master genes and intermediary metabolism
1. Cyclic AMP and adenylate cyclases: the discovery of a fourth cyclase class (M.-P. Coudart-Cavalli, P. Trotot, P. Biville, O. Sismeiro)
Cyclic AMP is a mediator of catabolite repression in bacteria. Curiously, despite the interest for this important process, not much was known on the rather elusive enzymes, adenylate cyclases, which make this molecule from ATP. In 1996, the work in the Unit had already discovered three main classes of these enzymes, which were apparently unrelated phylogenetically. Very remarkably, this work demonstrated that Gram negative bacteria could differ in the nature of the adenylate cyclase they harboured: enterobacteria had one type, while myxobacteria, or rhizobia had another type (a more ancestral form, presumably, since it is phylogenetically similar to the enzymes found in Eukarya). In the course of a screening for adenylate cyclases in bacteria related to enterobacteria, but differing from them, we made the surprizing discovery that A. hydrophila harboured a fourth adenylate cyclase type, an enzyme much related to proteins found in Archaea. This protein was found in all species of A. hydrophila investigated, but not in other Aeromonas sp. The counterpart of the gene was found in the Y. pestis genome, and shown to express adenylate cyclase activity (unpublished). The reason for this extraordinary variety in adenylate cyclases in not known.
2. Global analysis of the H-NS protein function (P. Bertin, F. Hommais, O. Soutourina, C. Tendeng and several trainees)
To study the global regulation of bacterial metabolism, in particular
in pathogenic microorganisms, we used the hns mutation in Escherichia
coli as a reference system. Indeed, the H-NS protein is known to be
involved in numerous fonctions in the cell and to affect the expression
of genes regulated by environmental factors (temperature, osmolarity,
...). Three main topics have been developped since 1996.
3. Pyrophosphate effects on Escherichia coli: a link with iron metabolism (F. Biville, E. Turlin, M. Perrotte, C.-K. Wun, and several trainees)
In the course of the study of cAMP synthesis in E. coli, the effect of pyrophosphate, a product of the reaction producing cAMP from ATP was investigated. A first series of experiments demonstrated that, in a phosphate-rich minimal medium pyrophosphate had a surprising stimulating growth effect. This effect resulted in a significant modification of the expressed proteome pattern of the cells. This could not be due to a phosphate starvation, and the first hypothesis which came to mind was that energy from the energy-rich bond of the molecule was somehow recovered by the cell. However all experiments meant to explore this hypothesis were unsuccessful. In particular the non hydolysable analog methylene diphosphate had an effect similar to that of pyrophosphate. Analysis of the metabolic activities which varied upon pyrophosphate addition suggested that the tricarboxylic acid cycle was somehow involved. Further exploration demonstrated that the pyrophosphate effect is mimicked by addition of excess iron to the medium. This demonstrated first that, even in a medium supplemented by 5 mM iron, there is still some iron deficiency in a phosphate rich minimum medium, and, second, that the pyrophosphate molecule somehow helps the cell to scavenge existing iron in the environment in a way which permit it to strive on a low iron level (M. Perrotte thesis). Work in progress demonstrates that a phosphorelay system (two-component regulator) of unknown function is involved in this process. When unraveled this will add interesting information on a set of genes of unknown function in the genome of E. coli and will contribute to improve its annotation.
4. Functional analysis of the B. subtilis genome: polyamines and sulfur metabolism (JY Coppée, P. Glaser, M.-F. Hullo, I. Martin-Verstraete, E. Presecan, A. Sekowska, C.-K. Wun)
Among the aims of genomes functional analysis is the possibility to rapidly
reconstruct entire metabolic pathways. This cannot be done using in
silico analysis alone, because many proteins have a common descent.
This results in the fact that related activities often share similar sequences
(e.g. a decarboxylase specific for a given amino acid must be similar
to its counterpart specific for another amino acid). We have therefore
constructed relatively rapid tests on plates with molecules or ions that
could help us to trace as efficiently as possible genes involved in integrated
metabolic pathways. Amino acid metabolism is not well described in B.
subtilis, and although quite a few gene similarities point to expected
enzyme activities, it is necessary to validate the hypotheses derived
from these similarities. We used amino acid analogs or certain types of
antibiotics is a way to achieve this goal. In addition, we set up several
growth condition tests (in particular for swarming or gliding on plates)
to test for more subtle phenotypes (A. Sekowska, thesis dissertation).