History of the HKU-Pasteur Research Centre




Research activity 2000-present (discontinued in 2009)

In year 2000, the Unit Regulation of Gene Expression ceased its activity after 14 years mostly devoted to the Bacillus subtilis genome sequencing programme. This created the opportunity to create the HKU-Pasteur Research Centre in Hong Kong, and to create a new Unit at the Institut Pasteur in Paris.

Summary of the 1995-2000 activity at the Regulation of Gene Expression Unit

In april 2000, the Director of the Unit left Paris for Hong Kong (China) to create the HKU-Pasteur Research Centre, a joint venture between the University of Hong Kong and the Institut Pasteur in Paris. At the end of that year, a new Unit was created in Paris: Genetics of Bacterial Genomes centered on research focusing on functional genomics of bacteria of general interest with the following core programme.

The conceptual background on which the activity of the new Unit is developed is based on the conjecture that cells behave as computers do: reading and expressing a program, in a purely declarative way. With his Universal Machine, Alan Turing showed that all computations as well as all operations of logic could be performed by a simple machine reading and modifying a tape carrying a linear sequence of symbols. This required only the physical separation between the data forming a string of symbols and the machine itself. Our previous demonstration that horizontal gene transfer accounted for a large fraction of bacterial genome sequences strongly supported the analogy of the cell as a Turing machine, to a point where it could be considered as highly revealing, if not (of course) explaining life in totality. However, machines do not make similar machines. Very simple automata such as crystals do so, but as soon as they are complicated enough, this apparently becomes impossible. There is only one exception, living organisms. They are nevertheless described using a computer-related metaphor, that of the genetic program. What would be the constrains if we had to think of a computer making a computer? In a crucial reflection due to John von Neumann, the answer is that within the computer there should be some type of an image of the machine that would also be passed along from generation to generation. This requires both a hereditary abstract component, similar to a text, and a structural component. Because in living organisms the most obvious hereditary component is the chromosome, it is interesting to explore if, and how, some image of the cell could be built in the chromosome organization.

This triggered our research programme: would it be possible to consider the cell as a Turing machine, and if so, what are the implications in terms of the concrete biological objects needed to make it run. The accidents that occur continuously during reproduction lead genes to be modified, to disappear, or to change place. One would therefore expect that after some time a more or less random distribution of the genes should be observed in genomes. In concrete terms, is the order of the genes random in the genome? And, in parallel, where are located in the cell the gene products, does one find them everywhere? The work in the Unit, in collaboration with the first programme of the HKU-Pasteur Research Centre (created in Hong Kong in 2000), consisted in exploring this conjecture, making experiments that might uncover some of the physico-chemical constrains that organise the cell. To this aim, experimental work at the bench was systematically combined with work in silico, performing conceptual investigations that serve as references and predictions for experiments. As a start point, we first analyzed the way in which DNA and genes are handled by the various machineries in bacteria, explored the diversity of the corresponding processes and then tried to see whether despite this diversity some common features emerged. We further explored bacterial genome diversity to find out the nature of processes that must be imbedded into the genetic programmes and allow one to understand what makes both their universal nature and their diversity.

The core of our approach is centered on exploration the relationships between biological objects (neighbourhoods), trying to relate the architecture of the genome to that of the cell. This view of biology as « symplectic » (from συν, together and πλεκτειν, to weave: this is similar to the latin "complex", but without the unfortunate connotations that are associated to the latter word) has been expanded in a book, The Delphic Boat, what genomes tell us, Harvard University Press, 2003. The concrete experimental and conceptual set up developed to explore the consequences of this view during years 2000-2007 are summarized below. Having shown that the codon usage bias, the chromosome strand preference, the expressivity and the essentiality of genes all cooperate to shape genomes, we looked for physico-biochemical selection pressures that may drive that organization. We chose to focus on two such constrains: temperature and reactivity of the sulfur atom, and we tested our interpretation of the data we collected on pathogenicity as an integrating process.

Our work is therefore divided into: (i) analysis of cell functions integration using comparative in silico analysis of genome structures, coordinated via a genetic and phylogenetic analysis, and a biochemical and physico-chemical experimental setup developed into two programmes, (ii a) genomics of bacteria living in extreme environments (cold conditions or toxic compounds); (ii b) deciphering of the sulfur metabolism pathways in several organisms; while (iii) the knowledge thus generated, including integration of the corresponding observations derived from large-scale expression profiling and modelisation is structured into genome annotation and construction of reference specialized databases.

To this aim we participated in or organised several genome programmes (Leptospira interrogans LAI , Photorhabdus luminescens TT01, Pseudoalteromonas haloplanktis TAC125 and Staphylococcus epidermidis ATCC12228; sequencing and annotation of Herminiimonas arsenicoxydans), we deciphered the elusive methionine salvage pathway in Bacillus subtilis and Pseudomonas aeruginosa and the major steps of regulation of sulfur transport and assimilation in B. subtilis. In parallel, we established several new rules of the bacterial genome's organisation, as well as universal rules in the protein composition of whole proteomes.

Summary of the years 2001-2003 activity of the Genetics of Bacterial Genomes Unit

Summary of the years 2003-2005 activity of the Genetics of Bacterial Genomes Unit

Summary of the years 2006-2007 activity of the Genetics of Bacterial Genomes Unit

blue_ball Summary of the activity of the Genetics of Bacterial Genomes Unit (2006-present)

Programme and activity of the Genetics of Bacterial Genomes Unit (2008-2009)

Not described

Activity of the Genetics of Bacterial Genomes Unit (2006-2007)

Abstract: The three processes needed to create life, compartmentalization, metabolism, and information transfer (memory stored in nucleic acids and manipulation operated by proteins) are imbedded in organized genome features. Research in the Unit aims at creating an integrated view of theses processes via the study of bacterial genomes. It explores specific features of metabolism, namely sulfur metabolism in Firmicutes and constraints due to confinment of reactive chemicals in the tiny volume of the cell. The role of unavoidable physical parameters such as temperature on the control of gene expression is also investigated. Metabolism of RNA molecules, particularly sensitive to temperature, is shown to provide an integrative link between metabolism and information transfer. In silico analysis of bacterial genomes demonstrate that he core of life puts together genes controlling growth and maintenance (which drives survival), while life in context is controlled by genes which explore and exploit specific niches. Analysis of persistence of genes in genomes shows that the former class of genes constitutes the paleome, which recapitulates the three phases of the origin of life: metabolism of small molecules on surfaces, substitution of surfaces by an RNA-world where transfer RNA played a central role, and invention of template-mediated information transfer. Colonization of the niche is performed using an unlimited set of genes, forming the cenome. The agreement of the paleome structure with a consistent scenario for the origin of life is such that we may consider extant genomes as providing us with an archive of the origin rather than as a palimpsest where most of our past would be irremediably hidden.

A new revolution is spreading in Biology: our level of understanding of what life is allows us to consider putting together what we understand into models of living cells. The next twenty years will be the time of Synthetic Biology. Research in the Unit aims at identifying contraints that we must take into account in order to propose a successful synthetic biology program. This work combines in silico analyses, where global properties of the genome or the proteome organisation are uncovered, with genetic and biochemical experiments identifying specific objects or regulatory processes that play a role as providing a selective pressure which ultimately results in genomes as we see discover them.

Metabolic studies identify a deep link between sulfur and RNA metabolism (G André, AM Gilles, MF Hullo, I Martin-Verstraete, U Mechold, O Soutourina, CH You and a variety of collaborators)

Among the six atoms that are essential to construct biomolecules: carbon, hydrogen, nitrogen, oxygen, phosphorus and sulfur, the latter is the least well understood, despite its enormous importance (all proteins derive from polypeptides which begin with a methionine residue, an amino-acid containing a sulfur atom). And this atom is extremely sensitive to electron transfers. It is also one of the most likely element of the primitive metabolism that ended into what we now identify as Life. Research in the Unit identified the major components of the anabolic cycle of sulfur acquisition by the cell via the synthesis of cysteine, methionine and the reverse transsulfuration pathway that allows the cell to make sulfur compounds from methionine as the source. First elements of transport of sulfur metabolites as well as identification of the central regulator of sulfur compounds synthesis have been identified, providing the first elements that can be used to construct models of the corresponding metabolism. Methionine recycling and repair are also under investigation, as possible links with the selection processes that operate on sulfur metabolism. In particular it has been observed that, in the pathway from commensalism to pathogenicity of Staphylococcus epidermidis, methionine repair plays an important protective role against innate immunity processes which use formation of reactive oxygen species as a means to combat infection. Further work, investigating the fate of the orphan metabolite 3',5'-adenosine bisphosphate (pAp) that is produced during sulfate assimilation, demonstrated the generally unsuspected link between sulfur metabolism and RNA degradation. Analysis of the targets of pAp showed that the molecules inhibited the activity of RNA degradation enzymes in very distant members of the Bacteria. In parallel we explore the details of the synthesis of nucleotides, in particular via the study of uridylate kinase, an enzyme which has an unusual hexameric structure in Bacteria, differing considerably from the homolog found in Eukarya.

Transcription initiation control is usually taken as the most important regulatory step in gene expression. However, the effect of modulation of transcription initiation can only be effective if the concentration of the regulated mRNA can significantly change upon variation of initiation rates. This can only occur if the volume of the cell is changing if the mRNA is stable, or if mRNA is degraded fairly rapidly. The consequence of this very simple observation is that processes leading to RNA degradation play an essential role in regulation of gene expression. In Bacteria, RNA is degraded by a combination of endonucleolytic attacks and usually processive exonucleolytic degradation of the RNA fragments. Processivity usually leads to very short RNA leftovers, which we named "nanoRNAs" to distinguish them from the microRNAs that play such an important role in nucleated cells. An obvious consequence of this situation is that nanoRNAs accumulation is highly toxic for the cell, as they interfere with transcription and replication, by entering the transcription or replication "bubbles". They need to be degraded by specific nucleases, which we named nanoRNAses accordingly. In Proteobacteria, to which Escherichia coli belongs, the enzyme Orn degrades oligonucleotides. We discovered that its activity is modulated by pAp, thus creating a strong coupling between sulfate assimilation and RNA turnover. The phosphatase that degrades pAp into 5'AMP, CysQ, therefore plays an interesting regulatory role. Reversing Monod's saying "whatever is true for E. coli is true for the elephant" we discovered that the E. coli enzyme, like the human enzyme is highly sensitive to lithium. Furthermore we found that the human homolog of Orn, Sfn, is also inhibited by pAp, opening an interesting conjecture about possible effects of lithium used to treat bipolar disorder.

Bacterial genomes combine a core set of persistent genes, meant to permit life and survival, with highly variable genes permitting life in context (G Fang, EPC Rocha, TZ Wang and collaboration with the Atelier de Génomique Comparative, Genoscope, the Computer Centre, Hong Kong University and the Biosapiens, Europathogenomics and Probactys European consortia)

Summary of the years 2003-2005 activity

Abstract: In collaboration with the Genoscope, the Universities of Liège, Naples, Stockholm and Strasbourg, and the University in Hong Kong, the Unit has completed the sequencing and annotation of the genome of a bacterium from Antarctica, Pseudoalteromonas haloplanktis TAC125. In parallel, it further developed its studies of sulfur metabolism in Bacillus subtilis, solving several problems of transport of sulfur containing molecules, and of the methionine salvage pathway. Work in silico (with computers) allowed us to characterize further many components of the constraints that operate on the gene distribution in chromosomes as well as new universal motifs: the class A flexible patterns, densely covering most genomes. In parallel, we defined the concept of gene persistence as conservation of orthologs not only in sequence but in location in the chromosome and in phylogenetic pattern. We have thus extended the concept of gene essentiality in bacteria, as those genes that are persistent in the sequence of their product and in their organisation inside the chromosomes. Moreover we uncovered ubiquitous rules in the distribution of amino acids in proteins. Our work aims at seeing post-sequencing biology as symplectic biology, where the links between objects make the core of the discoveries to come.

Rather than considering the hereditary material as a simple collection of genes, the aims of genomics are to provide an understanding of the functional organisation of genes within chromosomes and to explain how this organisation produces life. Bacteria are ideal subjects for such studies because they have existed for a very long time (more than three thousand billion years of evolution) and are highly diverse. Understanding how genes interact makes it possible to evaluate more accurately the adaptive potential of bacteria, both in the environment and in and on our bodies (they are everywhere and our bodies contain at least ten times more bacteria than human cells). Despite the negative connotations associated with bacteria, the fashion for nutraceutics ("medical" foods) is based on the implicit idea that bacteria are most often beneficial, even if, on occasion, they can become highly pathogenic. Surprisingly, there are few differences between commensal bacteria and the bacteria responsible for diseases. One of the aims of comparative genomics is to understand how differences in genome organisation can determine whether a bacterium is innocuous (or beneficial to the host) or virulent.

This requires, of course, reference models in which we understand practically all we can do about the organism. Two major classes of bacteria can be distinguished by a specific staining method developed by the Dane Christian Gram. Gram-positive bacteria are common in foods (lactobacilli and streptococci are present in yoghurts and cold meats, for example). In some cases, they may be pathogenic (Staphylococcus aureus). The model for these bacteria is Bacillus subtilis, for which the Unit has been a driving force behind genomic studies. Our current research aims to determine how the genes of this organism are organised, both by computer-based (in silico) studies based on the analysis of gene sequences and their products (mRNA and protein), and by the study of sulfur metabolism, which is highly structuring. We began by establishing a number of selective rules forcing genes to prefer one strand of DNA rather than the other. These rules are due to the exertion of a selection pressure that favours the progression of the transcription fork in the same direction as transcription, preventing conflicts leading to the production of truncated mRNA molecules, which in turn generate truncated proteins. Sulfur metabolism genes are grouped in functional islands. Within these islands, we recently characterised mostly genes encoding transport proteins and some of the proteins regulating their expression. We have now begun to extend our studies to pathogenic organisms of the same class. We have also characterised a little-known pathway — the methionine salvage pathway (all the proteins of living organisms begin with this amino acid) — which we have shown sometimes leads to the synthesis of an unexpected gas, carbon monoxide, that may act an intercellular signal, by means of currently unknown mechanisms.

Escherichia coli is the model gram-negative bacterium and is today the best understood organism in the world. As part of a transverse research programme, we analysed families of gram-negative bacteria in an attempt to determine what makes some bacteria beneficial and others not (for example, most strains of E. coli are harmless, but certain strains of E. coli cause colibacillosis, a well-known disease). We studied the determinants of pathogenesis in a related bacterium, Photorhabdus luminescens. This bacterium is extremely pathogenic in insects and would be highly dangerous to humans if it were able to grow at our body temperature, which is fortunately not the case. We characterised a series of genetic control systems, to identify the keys to the remarkable pathogenicity of this organism. This work will be continued in the next few years, using the silk worm as the host organism. One of the main advantages of this approach is that it enables us to study bacterial virulence without using mammals, whilst generating results that can be extrapolated to these animals.

A. Universal rules of bacterial genomics

Universal motifs covering genomes (E Larsabal)

Considering first the genome as a whole, we were interested by the enigmatic periodic bias of 10-11.5 in the distribution of nucleotides found in almost all genomes, from prokaryotes to eukaryotes. This bias is present throughout a given genome, both in coding and non-coding sequences. Using a technique for analysis of auto-correlations based on linear projection, we identified the sequences responsible for the bias. We showed that prokaryotic and lower eukaryotic genomes are literally covered with ubiquitous patterns that we termed "class A flexible patterns". Each pattern is composed of up to ten conserved nucleotides or dinucleotides distributed into a discontinuous motif. Each occurrence spans a region up to 50-70bp in length. We named them "flexible patterns" because there is some limited fluctuation in the distances between the nucleotides composing each occurrence of a given pattern. Remarkably, when taken together these patterns cover up to half of the genome in the majority of prokaryotes. We proved that they generate the previously recognized 11bp periodic bias. Judging from the structure of the patterns, we suggest that they may define a dense network of protein interaction sites in chromosomes. In organisms such as Helicobacter pylori these motifs constrain as much as one base in four, suggesting that the amino acid content of proteins might be affected. The reason why they had escaped identification until that work is that, being flexible, they cannot generate a rigid consensus sequence of the type which is universally considered by investigators analysing genome sequences.

Essential, not highly expressed genes, are located in the leading strand of the chromosome (Japanese-European consortium, G Fang, MF Hullo, E Rocha, A Sekowska)

With the international consortium set up to sequence the genome of B. subtilis we managed to inactivate one by one all genes of the organism. This led us to identify those which were needed for growth under laboratory conditions. In silico analysis further extended to other organisms allowed us to show that those « laboratory essential » genes are systematically encoded in the leading strand of the chromosome. We further showed that this most likely results from selection for avoidance of replication/transcription conflicts. This also showed that most if not all of those genes belong to multicomponent complexes. Furthermore, we showed that, in addition to the well-known operons or pathogenicity islands, genes belonging to common processes are grouped in the chromosome. This is true, in particular of the genes of sulfur metabolism, indicating that processing of this atom in the cell probably provides strong selective pressure on the cell organisation.

The concept of “persistent” genes (G Fang, E Rocha)

With this observation of a bias in the function of the genes located in the leading DNA strand, we further extended in silico the concept of gene essentiality in bacteria, focusing on gene persistence. For this we reversed the analysis, considering genes which are preferentially located in the leading strand, while not « laboratory-essential ». Comparing 55 genomes of Firmicutes and Gamma-proteobacteria to identify the genes which, while persistent among genomes, do not lead to a lethal phenotype when inactivated, we showed that the characteristics of persistence, conservation, expression and location are shared between persistent non-essential genes and experimentally essential genes. Persistent non-essential genes are related to maintenance and stress responses. This outlined the limits of current experimental techniques to define gene essentiality and highlighted the essential role of genes implicated in maintenance which, although dispensable for growth, are not dispensable from an evolutionary point of view. This study further allowed us to show that Firmicutes and Gamma-proteobacteria are mostly differing in the construction of the cell envelope, DNA replication and proofreading, and RNA degradation. Persistent genes should then be regarded as truly essential genes. They contribute considerably to the organisation of the genome sequence.

Universal biases in the composition of proteins: the “gluon” hypothesis (G Pascal, in collaboration with C Médigue, Génoscope)

The existence of so many universal rules prompted us, as a first step towards further studies, to analyse the overall amino acid composition of the proteomes of model prokaryotes, B. subtilis, E. coli and Methanococcus jannaschii, supposed to have undergone separate evolution for more than one billion years. Using multivariate analyses, we studied the overall amino acid composition of all proteins making a proteome, exploring the correlations existing between the structure and functions of the proteins forming a proteome and their amino acid composition. The electric charge of amino acids measured against hydrophobicity creates a highly homogeneous cluster, made exclusively of proteins that are core components of the cytoplasmic membrane of the cell (integral inner membrane proteins, IIMP) and we are now able to predict that a protein is an IIMP with at least 98% accuracy. A second bias is imposed by the G+C content of the genome, acting at the first codon position, indicating that protein functions are so robust with respect to amino acid changes that they can accommodate a large shift in the nucleotide content of the genome. A remarkable role of aromatic amino acids was uncovered. Expressed orphan proteins are enriched in these residues, suggesting that they might participate in a process of gain of function during evolution. We propose that many of these proteins could be considered as “gluons” stabilising multicomponent complexes, thus labelling the “self” of a species.

All these studies substantiate the relevance of our programme: genomes are not random bags of genes but organized entities. This observation requires that selection has been active in shaping genomes. Among the many possible factors that might play a selective role, we retained two for further experimental studies. The first one is a physical constrain to which bacteria are systematically confronted, temperature. We chose for the second one chemical reactivity, having in mind that two types of ubiquitous metabolites must be dealt with, gasses and radicals. The metabolism of sulfur was chosen as this atom is extremely reactive (its oxidation state goes from –2 to +6), while it is an obligatory component of cells and involved in the synthesis of all proteins at the initiator methionine amino acid.

B. In vivo analysis of global genome properties

Coping with cold: the genome of Pseudoalteromonas haloplanktis TAC125 (Génoscope, Universities of Hong Kong, Liège, Naples, Stockhlolm and Strasbourg, G Fang, E Krin, G Pascal, E Rocha, A Sekowska)

As protein H-NS is a major cold-shock protein, we used comparative genomics in a phylogenetic study, cloning the counterpart gene from a variety of organisms including aquatic bacteria and a Psychrobacter isolated from Antarctica. These studies substantiated the major role of the protein and its role in the management of protons in the cell. Its main target appears to be RNA, under conditions which need to be better understood. Work with Psychrobacter led us to understand the importance of cold conditions for investigating constraints permitting life to develop.

Philippe Bertin, who was in charge of this project, moved to set up a laboratory in Strasbourg and we shifted the work on H-NS to a more general work on the structuring role of cold conditions, the genome study of a cold-adapted bacterium, while maintaining an active collaboration with his newly created group).

Beside B. subtilis, our model remains E. coli as it is still the best known organism. To place its study in the light of evolution and within an integrated biological context, we compared it with counterparts belonging to the gamma-proteobacteria. To further explore its role in cold conditions in a global context, we undertook in collaboration with the Genoscope and several European universities the sequencing, annotation and physiological analysis of the fast growing Antarctica bacterium, Pseudoalteromonas haloplanktis TAC125. We discovered that a remarkable strategy for avoidance of Reactive Oxygen Species generation is developed by these bacteria, with concerted elimination of the ubiquitous sulfur-related molybdopterin-dependent metabolism, substantiating our emphasis on sulfur metabolism as a preferred target of integrative processes in bacteria. The P. haloplanktis proteome revealed an amino acid usage bias specific to psychrophiles, consistently appearing apt to accommodate asparagine, a residue prone to age in proteins through cyclisation and deamidation. Unexpectedly, P. haloplanktis did not have the hns counterpart. In contrast, we discovered that the hns defect in E. coli could be compensated for by the csrA gene of P. haloplanktis. This remarkable observation (CsrA is a non-coding RNA-binding protein) is the basis of our programme for the study of small non-coding RNAs in the organisation of the genome.

The genome sequence and annotation is available at two databases, one in Hong Kong, of the GenoChore suite (PsychroList), and one at the Genoscope, of the MaGe suite (Psychroscope). We are also exploring the possible biotechnological uses of growth at very low temperature. From 2005, at the end of the P. haloplanktis genome project, the Unit belongs to the GDR2909 : Arsenic metabolism in prokaryotes; from resistance to detoxication. Emphasis is placed by the Unit in its competence in genome annotation and metabolic studies (in particular in relation with sulfur metabolism). This work is an effort to participate in understanding enviromental issues associated to the quality of water and bioremediation.

It may be of interest to make here a side remark. We coordinated two genome programmes, that of B. subtilis, back in 1987, and that of P. haloplanktis at the end of 2003: while the genomes are comparable in length, the workforce needed for sequencing and annotating the latter was 100-times less than the former. This can place genome programmes in the perspective of the next four years.

From physical stresses to virulence in Enterobacteriaceae (JF Charles, S Derzelle, E Krin, E Turlin, collaboration with the Unit of Genomics of Pathogenic Microorganisms, Transversal Research Programme 53 of the Institut Pasteur)

Coping with a variety of stresses allows microbes to occupy their preferred niche. While this contributes to genome and gene products organisation, the special situation encountered by pathogens is worth investigating, as it is usually associated to a considerable amount of horizontal gene transfer which affects the genome organisation. To study its impact, we chose Photorhabdus luminescens, an insect pathogen which, in contrast to E. coli which lives in the very complex medium of the animal gut, lives under extremely well defined conditions, interacting with a nematode — with which it makes a symbiosis — and killing insect larvae in which it makes pure cultures. Furthermore, P. luminescens can be considered as a model of extreme virulence of Enterobacteriaceae which can be studied safely as its maximum growth temperature is 35-36°C. For this study we recruited an entomologist from the Institut, JF Charles, when P. Bertin left for Strasbourg.

With the Unit of Genomics of Pathogenic Microorganisms, we first sequenced and annotated the genome of P. luminescens TT01, showing that the bacterium synthesises an unexpectedly large panel of toxins and antibiotics. One toxin was particularly interesting as it killed mosquitoe larvae. Constructing DNA arrays of most of the genome CDSs, we subsequently developed large-scale expression profiling studies of the organism to study the global transcription organisation of the genome under a variety of environmental and genetic conditions. This allowed us to identify several global regulatory processes (mediated by PhoP-PhoQ; AstR-AstS and H-NS), which may explain the remarkable virulence of these bacteria, able to kill their hosts while injected as a very small inoculum. In the insect larva, these bacteria outcompete other microorganisms by synthesizing antibiotics; we unraveled the genetics and physiology of carbapenem, that we chose as it uses 4-phosphopantetheine, a sulfur-containing coenzyme.

Finally, we used the expertise we set up in a series of proteomic studies of Enterobacteriaceae (within one of the « Programmes Transversaux de Recherche » created by the direction), an analysis of E. coli strains from different origins, from commensals to pathogens. While this is just the beginning of a programme we hope will give promising results (see projects), we uncovered a metabolic pathway (deoxyribose degradation) that seems to be associated to pathogenicity.

Overall, during the past four years we gained an integrated view of gamma-proteobacteria, where transcriptome and proteome analyses, coupled to organisation of sequence and annotation data (the GenoChore suite), allowed us to explore in depth the way regulons involving large sets of genes (typically more than 100) are distributed in the genome. At this point it appears that we need to separate the genome into several classes, the persistent genes, which make its core, genes that are specific to one kind of environment (the « cenome »), and genes that belong to global regulatory networks. In the latter case it appears that RNA-protein complexes play a central role. In contrast to the situation with Gram-positive organisms, we find that relevant genes are distributed both in the leading and in the lagging strand, but probably in a highly non-random fashion. These observations are the leads that will be followed in the next few years, keeping in mind our original programme (uncovering rules of genome and cell organisation) and trying to relate our observation to shifts from free-living conditions to pathogenicity.

Sulfur metabolism: the methionine salvage pathway, methionine recycling and « orphan » metabolites (YJ Chen, AM Gilles, U Mechold, A Sekowska, CH You, collaboration with the Nara Institute of Science and Technology)

Besides temperature as an organizer, gasses and radicals are unavoidable reagents that are ubiquitous in the cells. We chose to study the sulfur atom, a preferred targets of some of these, as it is extremely reactive. In short, having both dioxygen and hydrogen sulfide in a cell in the presence of a metal ion is more or less similar to have a lit match in a gas station! This tells us that sulfur metabolism must be compartmentalized. However, as such, this metabolism is still very poorly understood, in particular in Gram positive organisms, and we undertook systematic exploration in B. subtilis looking for global properties (in gene distribution, gene expression profile etc). A preliminary study of polyamine metabolism showed us that nitrogen and sulfur metabolism were highly interconnected, and we chose to unravel the downstream part of that « orphan » metabolism (ie involving molecules that are generally overlooked in textbooks).

Both in the Unit in Paris and in Hong Kong we unravelled, in collaboration with scientists in Nara (Japan), the complete methionine salvage pathway, which from the orphan molecule methyl-thioadenosine (derived from S-adenosyl methionine, AdoMet), regenerates methionine. Several remarkable features of the pathway, not the least being it resulting in the production of carbon monoxide under particular conditions, make it play a major role in the organisms where it is present (from bacteria to plants and humans). The methionine salvage pathway has recruited different proteins for the required activities. In B. subtilis the pathway from methylthioadenosine to methionine involves no less than 8 enzymes (nucleosidase MtnN, kinase MtnK, isomerase MtnA, dehydratase MtnB, enolase MtnW, phosphatase MtnX, dioxygenase MtnD, transaminase MtnE). In Bacilli, MtnW is very similar to the most abundant enzyme on the earth Ribulose-bisphosphate carboxylase oxygenase (RuBisCO) and links the origin of this enzyme fixing carbon dioxide to sulfur metabolism by acquisitive evolution. In P. aeruginosa, the couple MtnN-MtnK is replaced by a phosphorylase, MtnP, and an enolase phosphatase, MtnC, replaces the couple MtnW-MtnX. This feature is conserved in plants and animals.

A second ubiquitous methionine salvage pathway, recycling of the first methionine of proteins, has been studied by C You (in Paris) and HY Lu (in Hong Kong) in a genetic, physiological and biochemical study of methionine aminopeptidases. This work has proven that the second map-like gene in B. subtilis, yflG, indeeds code for a methionine aminopeptidase, curiously considerably more active in vitro than the essential map, while dispensable. This suggests that this gene, which is the parent of the cognate map in Gram positive cocci, has specific targets in the cell, uncovering a level of global regulation (turnover of the N-terminus of proteins), that will need further investigation both in silico and in vitro.

Finally, we began to investigate the fate of another sulfur-related « orphan » metabolite, 3’-5’ adenosine diphosphate, which is created both during sulfur assimilation and 4-phosphopantetheine synthesis. Using columns tagged with the metabolite we identified several putative targets. The first one CysQ in E. coli, is a 3‘-phosphatase that we showed to be specifically inhibited by lithium. We are in the process of identifying the regulatory network involving this nucleotide which has many features in common with 3’-5’ cyclic AMP. Remarkably, while the function (degrading short oligonucleotides) is conserved, the structure is not, and the counterpart of the gene in B. subtilis (nrnA, ytqI) codes for both the phosphatase and the "nanoRNAse" activity.

Biosynthesis and transport of sulfur containing compounds (S Auger, P Burguière, S Even, MP Gomez, I Guillouard, MF Hullo, I Martin-Verstraete, transcriptome Platform of the Pasteur Genopole)

We systematically identified the major routes of sulfur assimilation in B. subtilis and uncovered a highly involved network of regulations, superimposing AdoMet-dependent riboswitches to LysR-type regulators as well as a new enigmatic regulator CymR (YrzC). In a first work, we showed that a specific activator, CysL (formerly YwfK), a LysR-type transcriptional regulator, activates the transcription of the cysJI operon, encoding sulfite reductase. We demonstrated that a cysL mutant and a cysJI mutant have similar phenotypes. Both are unable to grow using sulfate or sulfite as the sulfur source. The level of expression of the cysJI operon is higher in the presence of sulfate, sulfite, or thiosulfate than in the presence of cysteine. Conversely, the transcription of the cysH and cysK genes is not regulated by these sulfur sources. In the presence of thiosulfate, the expression of the cysJI operon was reduced 11-fold, whereas the expression of the cysH and cysK genes was increased, in a cysL mutant. A cis-acting DNA sequence located upstream of the transcriptional start site of the cysJI operon (positions -76 to -70) was shown to be necessary for sulfur source- and CysL-dependent regulation. CysL also negatively regulates its own transcription, a common characteristic of the LysR-type regulators. Gel mobility shift assays and DNase I footprint experiments showed that the CysL protein specifically binds to cysJ and cysL promoter regions. This was the first report of a regulator of sulfur anabolism in Bacilli. A second regulator, YtlI, was also studied in depth and shown to regulate an operon that transport and metabolizes sulfur-containing molecules. Bacillus subtilis, in contrast to E. coli, can grow on methionine as sole sulfur source. This was investigated and part of the reverse transsulfuration pathway (the pathway existing in humans) was uncovered. The work however demonstrated that cysteine and methionine syntheses are coupled through several other pathways, still not completely understood, creating a network of regulation that has no known counterpart yet. We assigned a function to the metI (formerly yjcI) and metC (formerly yjcJ) genes of B. subtilis by complementing E. coli metB and metC mutants, analysing the phenotype of B. subtilis metI and metC mutants, and carrying out enzyme activity assays. Interestingly, the MetI protein has both cystathionine gamma-synthase and O-acetylhomoserine thiolyase activities, whereas the MetC protein is a cystathionine beta-lyase. In B. subtilis, both the transsulfuration and the thiolation pathways are functional in vivo. Due to its dual activity, the MetI protein participates in both pathways.

The transport of many sulfur containing molecules was also investigated and transporters identified. Transcriptome and proteome experiments allowed us to explore the global properties of the sulfur regulatory network, and we uncovered a remarkable but still elusive functional link between sulfur assimilation and arginine metabolism. All these observations are now put together to explore how the corresponding network may have some role in the organisation of the cell.

Nucleotide kinases and their link with the genome architecture (C Evrin, AM Gilles and collaboration with the Laboratoire d'Enzymologie et Biochimie Structurales in Gif sur Yvette)

Sulfur metabolism aside, we have for several years documented analyses suggesting that biochemical structures making planes (often hexagons) or tubes must play a role in the genome structure organisation and we noticed that uridylate kinase had an hexagonal structure. Furthermore, the sources of nucleotides in the cell are likely highly organized, in particular those needed for DNA synthesis. The very fact that nucleosides diphosphates, not triphosphates are used as precursors of deoxyribonucleotides creates a series of paradoxes in the pyrimidine metabolism (UDP is made in de novo pyrimidine biosynthesis, whereas CDP is not, while DNA must avoid U and incorporate C). We were therefore very interested in the structuring role of nucleotide kinases in general, so that we took the opportunity to recruit Anne-Marie Gilles, a specialist of the biochemistry of these enzymes, when the Unit to which she belonged closed. Her work resulted in the determination of the structure of hexameric uridylate kinase, revealing a large number of unexpected properties (still not completely understood). Uridylate kinases make an original class in most bacteria, and it was interesting to compare their properties with other nucleotide kinases: the structure of a GMP kinase was also recently solved. At this point, as we have preliminary experiments indicating that the protein might be located at the cell’s envelope, it becomes interesting to analyse in detail the location of the corresponding proteins in the cell in relation with the organisation of their genes. Remarkably, uridylate kinase is systematically coded by a gene (pyrH) which, in Bacteria as distant as Firmicutes and Proteobacteria, belongs to an operon involved in translation, while uridine containing nucleotides have not, until now, been involved in the translation process. GFP fusions have been constructed using a new « recombineering » approach developed with JianDong Huang in Hong Kong. It already suggests that uridylate kinase is not evenly distributed in the cell, in contrast for example to adenylate kinase. This microscopic exploration is the most important at this point as our analyses of the E. coli genome organisation suggest that gene expression differentiates the two daughter cells upon division, one daughter being poised to resist to stress, while the other one is poised to multiply rapidly.

C. Miscellanei

Our research develops in a socio-economical context, and we found important, on a few occasions, to apply our knowledge to domains of some social relevance. Several features of our conceptual approaches can be extended from our specific domain to other research fields. When in Hong Kong we witnessed the origin of the severe acute respiratory syndrome (SARS) outbreak, and this led us to apply some of our analytical techniques to the genome of the virus. We also developed a « double epidemic » model of the disease which explains much of the unusual behaviour of the spread of the epidemic. We analysed in 2004 the new features of the recurring epidemic of atypical pneumonia (SARS) and were involved, with the epidemiological consortium of GuangDong (co-ordinated by Prof. Guoping Zhao from the Shanghai Genomic Centre), in a molecular study of the characteristics of the epidemic. The results obtained, which were very instructive and entirely compatible with the hypothesis of a double epidemic formulated in 2003, have been published in 2005. In the same way, expression profiling analyses in bacteria provides an internal control both of the quality of the data, and of the relevance of the approach. For example, in transcriptome experiments, while we do not provide as an input the existence of operons (co-transcribed structures), we ought to find them in the outcome of any analysis. This is what we observed, and this allowed us to extend our studies to cancer diagnostic studies, in a way which may have some interesting outcomes.

Summary of the years 2001-2003 activity of the Genetics of Bacterial Genomes Unit

Abstract: The Genetics of Bacterial Genomes Unit makes use of the determination of the complete sequence of bacterial genomes to explore how the distribution of the genes along the chromosomes is (or is not) coupled to the distribution and/or function of genes in the cell. This study is focused on sulfur metabolism. Because of the Severe Acute Respiratory Syndrome outbreak the Unit participated in a theoretical work meant to explore the SARS-CoV coronavirus, and its spread during the outbreak (the “double epidemic” hypothesis).

The revolution of "genomics" that has recently transformed biology is steadily producing new spectacular discoveries. While the world of mass media tends to concentrate almost exclusively on the “ Human Genome ” project, it is more and more evident that one will not be able to understand much of this genome if we do not possess a thorough knowledge created by the study of powerful models, microbes in particular. This is the basic reason that drives all the major Genome Centres in the world to develop the study of microbial genomes.

Furthermore, the two major discoveries derived from genome studies in the past fifteen years were obtained after analysis of two microbial genomes: that of Saccharomyces cerevisiae and that of Bacillus subtilis (with a major contribution from the Unit of Regulation of Gene Expression, that predated the present Unit). The first one demonstrated that a fair number of genes are not fixed, belonging to a given organism, but tend to propagate from organism to organism (“horizontal” gene transfer, a concept for which we provided strong experimental evidence as early as 1991). The second one was totally unexpected: it showed that a very high proportion of the genes present in a genome, whatever the organism, does not have a known function. This is the more surprising because we now know more than one thousand genome sequences. In this context, the work in the Unit, in collaboration with the activity developed by the first program of the HKU-Pasteur Research Centre (created in Hong Kong in 2000), consists in exploring these unknown functions. To this aim, experimental work at the bench is combined with work in silico (using computer programmes), to perform conceptual experiments that serve as references and predictions for experiments performed at the bench. The central conjecture explored in the Unit is to know whether, and in the affirmative why, genes are not distributed randomly in the chromosomes. It is obvious that the accidents that occur continuously during reproduction lead the genes either to be modified, to disappear, or to change place. One would therefore expect that, after some time, a more or less random distribution of the genes should be observed in genomes. However the very idea that founded conceptual genomics, derived from the idea of the “ genetic programme ”, is that a cell behaves more or less as does a computer, where the machine is truly separated from the data and programmes it works with. However, one knows that a computer is not able to duplicate itself. What else is therefore needed? John von Neumann at the beginning of the 1960's made the hypothesis that, if this were to happen, then one should find somewhere an image of the machine. This drives the quest of the Unit: its scientists try to know whether the cell and its programme are organized structures. In concrete terms, is the order of the genes random in the genome? And, in parallel, where are located, in the cell, the gene products, does one find them everywhere?

An important part of the work in the Unit therefore consists, on the one hand, in organizing the data making biological knowledge (Ivan Moszer, and construction of the GenoList databases, until the time when he left teh Unit to set up a new axis at the Genopole of the Institute, the Platform N°4 at the Genopole, and the BIOSUPPORT programme at the HKU-Pasteur Research Centre), and, on the other hand, to analyze the genome structure (Eduardo Rocha and scientists from the HKU-Pasteur Research Centre, see http://bioinfo.hku.hk/genolist.html). The most surprising discovery made during year 2003 has been that the genes that are essential to the life of bacteria are distributed along the replication leading strand of the DNA double helix, and that this is not directly correlated to a high level of expression. This is accounted for by the absence of conflict between transcription and replication for these genes, because the collisions that occur when the genes are located on the lagging replication strand must often create truncated messenger RNAs, and hence truncated proteins. Furthermore, this discovery indicates that the products of these essential genes systematically belong to complexes formed by the association of several proteins, for one would hardly explain the toxicity of a truncated product unless it destroys the complex it is forming (let us think of a building with truncated beams!). In parallel, the Unit participated in the deciphering of the complete genome sequence of three bacteria: Leptospira interrogans (in collaboration with the Genome Centre in Shanghai), bacteria that are particularly dangerous and infect peasants working in rice paddies; Staphylococcus epidermidis (collaboration with the same Centre and Fudan University in Shanghai), bacteria present in the environnement and important for nosocomial infections; and Photorhabdus luminescens (sequenced in the Laboratory of Pathogenic Microorganisms Genomics in the Institute), highly virulent against insects, including against mosquito larvae (Jean-François Charles, Sylviane Derzelle and their coworkers). Our involvement in deciphering genomes led us to participate to the European Union Network of Excellence BioSapiens.

At this stage, it is of prime importance to understand where the gene products are located inside the cell. The study of uridylate kinase by a group that recently joined the Unit (Anne Marie Gilles and Octavian Barzu) will give interesting information in this domain. Another approach, developed in the Unit for several years, is to understand the organisation in the cell of the production of molecules containing sulfur. This is because of the extreme versatility of this atom in terms of oxido-reduction states. The study of sulfur metabolism has therefore been emphasized (Isabelle Martin-Verstraete in Paris and Agnieszka Sekowska in Hong Kong) in particular because the knowledge of this atom was often extremely limited because of the difficulty of the genetic and biochemical studies of sulfur-containing molecules. We have unravelled several new pathways in Bacillus subtilis (one of the two most studied bacteria) and further characterized the methionine salvage pathway that was unfolded in part during the past years.

Finally, year 2003 witnessed the dangerous development of the outbreak of atypical pneumonia (Severe Acute Respiratory Syndrome) and we judged important to participate in the fight against the disease, on the one hand with theoretical studies on the genomes of coronaviruses (at the HKU-Pasteur Research Centre), and on the other hand by an epidemiological model that was meant to get an idea about the origin of the disease and of its development (in collaboration with INRIA and the Department of mathematics at the University of Hong Kong). The model proposed, that of a “double epidemic”, caused by a first innocuous virus, that can mutate in certain patients and lead to the phenomenon of SARS fits well with observations in the field (in particular the large difference between diverse regions in China). This model suggests that the initial virus could stay as an endemic mild pathogen, and might occasionally lead to a resurgence of the disease. It is also interesting in that it suggests that the primary infection caused by the ancestral virus can protect against the disease and proposes that a vaccine could be developped (at least a vaccine with a significant protective effect, if not a long-lasting one). In collaboration with the Shanghai Genome Centre and the Guangdong SARS Consortium we participated to the analysis of the autumn 2003 SARS episode, showing that the virus is still evolving fast in its human and civet cat hosts.

At this stage, it is of prime importance to understand where the gene products are located inside the cell. The study of uridylate kinase by a group that recently joined the Unit (Anne Marie Gilles and Octavian Barzu) will give interesting information in this domain. Another approach, developed in the Unit for several years, is to understand the organisation in the cell of the production of molecules containing sulfur. This is because of the extreme versatility of this atom in terms of oxido-reduction states. The study of sulfur metabolism has therefore been emphasized (Isabelle Martin-Verstraete in Paris and Agnieszka Sekowska in Hong Kong) in particular because the knowledge of this atom was often extremely limited because of the difficulty of the genetic and biochemical studies of sulfur-containing molecules. We have unravelled several new pathways in Bacillus subtilis (one of the two most studied bacteria) and further characterized the methionine salvage pathway that was unfolded in part during the past years.

Finally, year 2003 witnessed the dangerous development of the outbreak of atypical pneumonia (Severe Acute Respiratory Syndrome) and we judged important to participate in the fight against the disease, on the one hand with theoretical studies on the genomes of coronaviruses (at the HKU-Pasteur Research Centre), and on the other hand by an epidemiological model that was meant to get an idea about the origin of the disease and of its development (in collaboration with INRIA and the Department of mathematics at the University of Hong Kong). The model proposed, that of a “ double epidemic ”, caused by a first innocuous virus, that can mutate in certain patients and lead to the phenomenon of SARS fits well with observations in the field (in particular the large difference between diverse regions in China). This model suggests that the initial virus could stay as an endemic mild pathogen, and might occasionally lead to a resurgence of the disease. It is also interesting in that it suggests that the primary infection caused by the ancestral virus can protect against the disease and proposes that a vaccine could be developped (at least a vaccine with a significant protective effect, if not a long-lasting one).

Summary of the 1995-2000 activity at the Regulation of Gene Expression Unit

Created in 1986, the Regulation of Gene Expression Unit analysed the nature of heredity, which stores the information required to generate life, and focused on the determination of gene functions in reference genomes, coupling prediction with computers (experiments in silico) to experiments in vivo. The scientists in the Unit investigated how the thousands of genes in the chromosome of a cell co-operate in an organised manner in an ever-changing environment. Their studies have been guided both by the results of experiments in vivo, which allowed the scientists to identify genes placed higher and higher in the hierarchy of genetic controls of the cell life, and by the spectacular progresses of molecular biology. Two reference micro-organisms are used: Escherichia coli, the most long-standing genetic model; and Bacillus subtilis, a source of numerous enzymes used by industry, often found on the surface of leaves, and abundant in soil. The studies developed in the Unit identified genes that are critical to the overall adaptation of the bacterium to its environment, and were particularly investigating the metabolism of molecules essential for the cell's construction, that of sulfur and polyamines. Two-dimension gel electrophoresis of all the proteins in the bacteria was used to describe the co-variations in the concentrations of particular proteins as a function of the growth status of the bacteria, their environment and changes in the genes studied. Associated to the analysis of the whole set of transcripts (expression profiling) together with analysis of the genome sequences, typical of the new field of research now called "genomics", this approach provided a wealth of information. It has shown that there are large groups of genes within the cell that are regulated in the same way. A mass of information was generated by sequencing genomes, while many of the newly identified genes were found to be enigmatic in nature. To contribute to their understanding, molecular genetic studies in the Unit were being complemented by research involving the most up-to-date techniques in computer data management, statistics and mathematics.

Two specialized databases have been constructed in collaboration with the University Paris 6 (Atelier de Bioinformatique) and the University of Versailles (they are available on the World-Wide Web: SubtiList, and Indigo). Biological and naturalist aspects of the work were being emphasised, to identify the major functions of the living organisms. In particular, the first analyses of the genomes has led to a remarkable observation: the order of genes on the chromosome is correlated to the cell's architecture. Indeed, the gene order in genomes is not random, and there are experimental hints suggesting that the map of the cell may be directly related to the chromosome structure. The first results of the in vivo, in vitro and in silico investigations aiming at understanding the selection pressure that underlies these architectural constraints suggested the systematic existence of supra-macromolecular complexes. Their components have their genes distributed in a non-uniform way along the chromosome, and they probably constitute structures of 10 to 50 nanometers that form the core of the cell's organization.

A genome-wide view of the coordination of gene expression

1. The Bacillus subtilis genome sequence (P. Glaser, MF Hullo, with several students for variable periods of time)

In 1996, the very short genomes of two bacteria had been published by TIGR, and the yeast genome sequence was about to be completed. Started almost ten years earlier, the B. subtilis genome program was well on its way, and a BIOTECH grant from the European Union was supporting a consortium of European laboratories for completing the sequence, expected to be finished at the end of year 1998. The Japanese consortium was also well on the way to completing its part. However, it appeared that it would be important to speed up our efforts to be present on the international scene at a moment when many laboratories began to be interested in the outcome of genome programs. Together with Frank Kunst, the European coordinator of the program, we decided to speed up the procedure, by involving laboratories which had been part of the yeast genome program in the sequencing effort. This was made somewhat difficult because many regions of B. subtilis DNA, as with all A+T rich Gram positives, are impossible to clone in standard E. coli recipients. We therefore combined standard cloning procedures in a special E. coli strain constructed for this purpose (TP611), cloning into B. subtilis itself, and Long Range PCR (without cloning) for the most difficult regions. This permitted us to possess the complete genome sequence in april 1997, well before the time expected, and to distribute it to the members of the consortium. In addition, we chose to distribute to the yeast teams some regions where we suspected the presence of errors, so that they would be sequenced again, ending with an excellent accuracy (of course, this was not said to the relevant sequencing groups, to avoid useless conflicts inside an exemplary collaborative effort). The complete sequence was presented at the International Bacillus Meeting in Lausanne mid-july 1997, and the sequence was made public in parallel with its publication in november of that same year.

2. Data bases and genome annotation platforms (Maude Klaerr-Blanchard, Claudine Médigue, Ivan Moszer, Eduardo Rocha, in collaboration with Louis Jones at the Service Informatique Scientifique, and several laboratories external to the Institut Pasteur)

Derived from its prototype, Colibri, for the E. coli genome, the sequence and annotation is displayed in the relational database SubtiList, which already met several thousands queries per day, more than two years after the sequence has been published.

Annotating a genome is a never ending process. Indeed, SubtiList is regularly updated, and the last update, just after the Genome 2000 International Meeting at the Institut Pasteur in April 2000, provided identification for several hundreds new genes. To prevent misannotation and propagating errors, we have assigned a special code name to all genes which have not been explicitely identified by their function (i.e. experimentally, in vivo or in vitro). In agreement with Amos Bairoch (SwissProt), we chose for these gene names that they all begin with a "y" letter. The code we have used follows as closely as possible Demerec's rule for gene nomenclature, despite much discussion from the community of B. subtilis scientists who often stick to old names, often without other reasons than purely anecdotal. We think that harmonizing nomenclature is very important for the future of the genetics of genomes.

Careful annotation asked for an elaborate approach in terms of computer sciences. In collaboration with Alain Hénaut and Jean-Loup Risler from the Université de Versailles Saint-Quentin (subsequently, Evry), François Rechenmann from INRIA Rhône-Alpes and Alain Viari from the Atelier de BioInformatique at the University Paris 6 (then INRIA, Grenoble), the Unit created an original strategy allowing genome annotation in silico. In this strategy the concept of "neighbourhood" has been favoured as a way to help discovery.

This strategy developed a succession of three relatively independent levels. Each level comprised as generic, and a specific level. The goal for the creation of the process was conceptual. It aimed at the prediction of essential biological functions using the genomic text, together with the associated biological knowledge distributed in scientific publications and data libraries. The precise goal was to identify crucial experiments (to be performed in "wet" laboratories), or to falsify the prediction. They were illustrated (see below) in the case of polyamine metabolism.

The three levels of the process were:

o 1. sequence data and annotation management: SubtiList and Colibri
o 2. a platform for sequence annotation: Imagene
o 3. a platform as a help to discover (technique of "neigborhoods"): Indigo

Each level was made of a generic computer software engine, together with specific data. The aim of the process was to define a set of three coupled software engines. Each specific application of the process gave as many valuable results as sequences, annotations or predictions which are created.

1. SubtiList is constructed from an engine for the management of  genomic databases. It is composed of three parts:

1.1. A data scheme structure, GenoList;
1.2. A data base management system (4th Dimension for stand alone applications and Sybase for the WWW database);
1.3. An user interface, eventually with specific procedures for data exploitation (e.g. Blast, Fasta, and other rapid methods for sequence analysis). The interface can be reconstructed knowing simply the World-Wide Web access to it, but this can only be done properly knowing 1.1. These user-oriented features explain why there are not yet equivalent bacterial specialized databases.

To construct a specialized database (SubtiList, Colibri, TubercuList, and recently PyloriGene, and a new set of databases organized along a new structure in Hong Kong, for a large number of bacterial genomes) it was necessary to introduce sequence data and their annotations in the GenoList engine. This input required a generic procedure. The value of the specialized databases comes from two sides, on the one hand from the genericity of the GenoList engine and of its user-friendly and biologically-oriented construction, and on the other hand from the quality of the sequences (above all, of their annotation). This value diminishes (respectively increases) as time elapses if annotation are not (respectively, are) curated. Curation of a set of annotations allows an important appreciation, in parallel with the creation of a know-how that is extremely difficult to reproduce. The Regulation of Gene Expression Unit was curating the B. subtilis annotations.

2. Imagene is a generic engine which allows management and strategic organisation of both biological objects (sequences, annotations, images, …) and methods for analysis or management, within the same platform. It is meant to make an in-depth analysis of genomes locally, for a fine description of their properties. It has been validated on the specific example of B. subtilis, by permitting identification of all its coding sequences and regions of transcription termination. It was used to predict regions that carried errors due to the sequencing process. These regions were PCRed out of the chromosome and resequenced by an independent team.

2.1. The platform was constructed in such a way as to allow one to plug-in easily any methods for genome analysis (even methods for which the source code is not available, or methods located far away but available through the Internet).;
2.2. It allows the chaining of methods, the definition of strategies, and, if needed, the ability to go reverse during the chaining of methods;
2.3. It possesses a generic visual interface (APIC) permitting one to start and control the progress of methods and of their results. APIC allows one to superimpose the results of entirely independent methods on the same screen. It permits direct access to their results.

Two special features give added value to Imagene as time elapses. On the one hand the data can be organised in such a way as to construct efficient specalised strategies for genome annotation. On the other hand, the number of methods for analysis that are plugged-in can increase without limitation. If they are accessed through the Internet the engine will know how to start them and recover their results. It will also know how to integrate them into its strategies. As a consequence the rational integration into new strategies of old and new methods will increase with time and use. We can notice, among the methods, the special case of data managenement: it is therefore quite possible to think about plugging-in to Imagene specialised databases. This will be seen by Imagene as a special task, data management. One also can think about creating relationships between sequence and annotation data (neighbourhoods, see Indigo, next paragraph).

3. Indigo is a prototype platform used as a help for discovery, meant to find difficult to predict neighbours between the various functions related to genes (P. Nitschké, C. Hénaut, in collaboration with P. Guerdoux-Jamet and A. Hénaut).

Indigo is organized in a simple way around a hierarchy of flat files, all centered around gene names, and corresponding to homogenous classes of features (such as codon usage bias, proximity in the chromosome, in metabolism, in isoelectric point of the gene products, in functional class, in literature articles, etc). It is clear that many other types of neighbourhoods should be considered as well, including quite elaborate ones. As an immediate goal for the improvement of Indigo one must create a data structure for the neighbourhood relationships. The published prototype is only meant to demonstrate the feasability of the approach. It illustrates the possibility to make interesting discoveries, even with the limited means allocated at present. Indigo is superficially organised as is GenoList. It possesses an engine (written in Java), that overlaps with the user interface. It is applied to specific data (at the present time E. coli, B. subtilis and Arabidopsis thaliana). One must therefore note that, even more than in the case of specialised databases, the value of a specialised Indigo is directly linked to the quality of the data included. The corresponding information results from annotation steps (statistical analysis for example, that could of course be produced by a strategy included in Imagene), but also from the extraction, at this time manual, of literature neighbours. The creation of appropriate files of this type could rapidly acquire a great value (if they are not publicly available). Finally, because Indigo is a method, it can be, in principle, plugged-in to Imagene.

This set of coordinated approaches has been used to set up an international network financed by the European Science Foundation.

3. There is a map of the cell in the chromosome (I. Moszer, E. Rocha, in collaboration with A. Hénaut and A. Viari)

Knowledge of whole genome sequences is a unique opportunity to study the relationships between gene and gene products at the global level of the cell's architecture. Part of the difficulty of this study comes from the fact that — contrary to a generally accepted intuitive idea — there is often no predictable link between structure and function in biological objects. However, as the outcome of natural selection pressure, there must exist some fitness between gene, gene products and the survival of the organism. This indicates that observing biases in features which would conceptually be thought of as to be unbiased, is the hallmark of some selection pressure.

This prompted us to study global properties of complete genomes. A first analysis on the word content of genome texts suggested that they are not all managed in the same way. We therefore concentrated on long exact repeats, and discovered that, in contrast to what could be expected, the shortest genomes (the Mycoplasmas) had the highest repeat frequency. Also, genomes of comparable sizes such as those of E. coli and B. subtilis have an entirely different way to manage repeats. They are present everywhere in the former genome, while they are very rare, and in close proximity (ca 10 kb) in the latter. In constrast, when we studied the distribution of words, bases or codons in the leading strand as compared to the lagging strand, we made an extremely surprising discovery. There is such a strong bias in one strand as compared to the other (the leading strand is G+T-rich, while the lagging strand is A+C-rich), that the bias is reflected in the amino acid composition of the proteins encoded by each strand (valine-rich for the leading strand, isoleucine+threonine-rich for the lagging strand)! This bias is not present in all genomes (it seems to be absent from genomes of bacteria having an important proportion of membranes, such as the methanogens or the cyanobacteria), but, when present, it is universally the same.

Among other consequences, all these observations tell us that genes do not move as frequently, or as easily as it is often implicitely assumed. There must exist, therefore, constraints in the gene organisation of a chromosome.

Because the genetic code is redundant, coding sequences can be studied by analysing their codon usage. If there were no bias, all codons for a given amino acid should be used more or less equally. In contrast, it has long been observed in E. coli that genes could be split into three classes according to the way they use codons. The same was true for B. subtilis. Yet, random mutations should somehow smooth out differences. This is not the case: indeed, for leucine, where six codons are used, we find that the CUG codon is used more than 70% of the cases in genes that are expressed at a high level during exponential growth conditions, while CUA is expressed in less than 2% of the cases. What is the source of such biases? There might exist a systematic effect of context, some DNA sequences being favoured or selected against. While this could be true for some codons, this cannot be generalized. We know that translation of mRNA into proteins requires the action of transfer RNA adaptor molecules. Because there is less tRNAs specific for a given amino acid than the number of codons, some tRNAs must read several codons. A bias in the concentration of tRNAs might thus result in a bias in codon usage. Therefore we must analyse selection pressure occuring at the level of tRNA synthesis. This is the generally accepted reason to account for the codon usage biases. Unfortunately, two reasons go against this interpretation. Firstly, in much the same way as that there would be all reasons to smooth out biases in codon usage, similar constraints would smooth out biases in tRNA synthesis. For example if a tRNA gene had a strong promoter, spontaneous mutations would tend to lower its efficiency, making transcription of this particular tRNA similar to its other counterparts. This is true, unless there is selection pressure for the converse. The second reason is that, while explanation for the strong bias in a given class of genes could be explained in this way, the same explanation cannot hold for a strong bias in another class of genes. However we know, both from the study of the E. coli and B. subtilis genomes, that two classes of genes display extremely strong, but different biases. And a same tRNA molecule cannot be both expressed at a high level, and not expressed at a high level…

This requires looking for another explanation. The cytoplasm of a cell is not a tiny test tube. One of the most puzzling feature of the organisation of the cell cytoplasm is that it must accomodate the presence of a very long thread molecule, DNA, and that this molecule must be transcribed as a multitude of RNA threads that usually have a length of the same order of magnitude as the length of the whole cell. This asks for some organisation of transcription, translation and replication so that mRNA molecules and DNA are not mixed up together all the time. The volume occupied by a ribosome is a cube with an 200 Å edge. In an E. coli cell growing exponentially in a rich medium there are at least 15,000 ribosomes. Thus, the fraction of the cell volume occupied by ribosomes is at least 12 %. The actual volume of the cell free of ribosomes is in fact significantly smaller if one takes into account the volume occupied by the chromosome and by the transcription and the replication machineries. If one now counts that the translation machinery asks for an appropriate pool of elongation factors, tRNA synthetases and tRNAs, it becomes clear that the cytoplasm behaves like a gel. In addition, simply counting the number of tRNA molecules sitting around a ribosome, it appears that one cannot speak about the concentration of such molecules, but only about a small, finite number. Compartmentalisation has been demonstrated to be important even for small molecules, despite the fact that they could diffuse quickly. As a consequence, a translating ribosome acts as an attractor of a certain pool of tRNA molecules. In such a case diffusion should only be considered locally. The cytoplasm becomes therefore a ribosome lattice, displaying relatively slow movements with respect to local diffusion of small molecules as well as macromolecules. This provides an efficient selection pressure leading to adaptation of the codon usage of the translated message as a function of its position in the cell's cytoplasm. If the codon usage bias changes from mRNA to mRNA, this indicates that these different molecules do not see the same ribosomes in the usual life cycle of the organism. In particular if two genes have very different codon usage bias this indicates that the corresponding mRNAs are not made from the same part of the cell (it is indeed difficult to see how ribosomes sitting next to each other could attract different tRNA molecules).

Several models of transcription account for a process where the transcribed regions are present at the surface of the chromoid, so that RNA polymerase does not have to circle the double helix it is unwinding and transcribing. Thus mRNA threads, usually structured at their 5' end, are pulled off DNA by the lattice of ribosomes, going from one ribosome to the next one, as does a thread in a wiredrawing machine (this is exactly the opposite view of textbooks translation, where ribosomes are supposed to travel along fixed mRNA molecules). In this process a nascent protein is synthesized on each ribosome, spread throughout the cytoplasm by the linear diffusion of the mRNA molecule from one ribosome to the next one, avoiding the requirement for the much slower 3D diffusion of the protein. Polycistronic operons ensure that proteins with related functions are co-expressed locally, permitting channelling of the corresponding substrates and products. It seems likely that the structure of mRNA molecules is coupled to their fate in the cell, and to their function in compartmentalisation. The fate of mRNA is therefore an important feature of gene regulation. We have therefore investigated the degradation process of mRNAs, comparing data extracted from the genomes of B. subtilis and E. coli. This led us to identify a main function of the elusive enzyme polynucleotide phosphorylase, as producing CDP needed for DNA synthesis, thus coupling translation, transcriptiona and replication together. If we consider genes translated sequentially in operons as physiologically and structurally relevant, we should also analyse mRNAs that are translated parallel to each other. Indeed if there is correlation of function and/or localisation in one dimension, there should also exist a similar constraint in the orthogonal directions. How would this be seen? This is where codon usage comes again. Indeed if ribosomes act as attractors of tRNA molecules, this implies a local coupling between these molecules and the codons they can use in the message they read. Obviously, this requires that the same ribosome mostly translates mRNAs having similar codon usage. This has the consequence that as one goes away from a strongly biased ribosome, there is less and less availability of the most biased tRNAs. In turn, there would be selection pressure for a gradient of codon usage bias as one goes away from the most biased messages and ribosomes. Transcripts are nested around central core(s), formed of transcripts for highly biased genes. This fits with what is seen of the general organisation of genes in the chromosome. In particular this agrees with the observation that the distance between E. coli genes oriented in the same direction on the chromosome is positively correlated to the expression level of the downstream gene.

Finally, the chromosomes must separate from each other and migrate in each of the daughter cells. There must exist some kind of repulsive force that pushes DNA strands away from each other. While there are probably gene products involved in this process, ribosome synthesis, in particular from regions near the origin of replication, performs exactly what is needed, by continuously creating new ribosomes. Continuous synthesis of ribosomes in between the replicating forks would also provide a mechanical stress on the bacterial wall in the middle of the cell. Koch has convincingly argued that the bacterial wall is indeed a stress-bearing fabric. If ribosome sources are organisers of the cell, mRNA for genes highly expressed under exponential growth conditions should be located near the center of these organisers, while other mRNAs should be translated in nested layers, all the way to the ribosomes that are located near the cytoplasmic membrane, and that would be involved in cotranslational membrane protein localisation. Organisation of the genes in the chromosome should therefore show regularities that are linked to this architecture, as we have indeed observed. This gives us strong reasons to propose that genes along the chromosome specify the map of the cell, a kind of celluloculus.

A geneticist's view: master genes and intermediary metabolism

1. Cyclic AMP and adenylate cyclases: the discovery of a fourth cyclase class (M.-P. Coudart-Cavalli, P. Trotot, P. Biville, O. Sismeiro)

Cyclic AMP is a mediator of catabolite repression in bacteria. Curiously, despite the interest for this important process, not much was known on the rather elusive enzymes, adenylate cyclases, which make this molecule from ATP. In 1996, the work in the Unit had already discovered three main classes of these enzymes, which were apparently unrelated phylogenetically. Very remarkably, this work demonstrated that Gram negative bacteria could differ in the nature of the adenylate cyclase they harboured: enterobacteria had one type, while myxobacteria, or rhizobia had another type (a more ancestral form, presumably, since it is phylogenetically similar to the enzymes found in Eukarya). In the course of a screening for adenylate cyclases in bacteria related to enterobacteria, but differing from them, we made the surprizing discovery that A. hydrophila harboured a fourth adenylate cyclase type, an enzyme much related to proteins found in Archaea. This protein was found in all species of A. hydrophila investigated, but not in other Aeromonas sp. The counterpart of the gene was found in the Y. pestis genome, and shown to express adenylate cyclase activity (unpublished). The reason for this extraordinary variety in adenylate cyclases in not known. A fifth class has been discovered by Cotta et al. showing a remarkable case of phylogenetic convergence.

2. Global analysis of the H-NS protein function (P. Bertin, F. Hommais, O. Soutourina, C. Tendeng and several trainees)

To study the global regulation of bacterial metabolism, in particular in pathogenic microorganisms, we used the hns mutation in Escherichia coli as a reference system. Indeed, the H-NS protein is known to be involved in numerous fonctions in the cell and to affect the expression of genes regulated by environmental factors (temperature, osmolarity, ...). Three main topics have been developped between 1996 and 2000.

Motility and/or flagellum biosynthesis have been frequently associated with virulence in various microorganisms. In enterobacteria, this process requires the expression of numerous genes scattered on the chromosome and organised in an ordered cascade. The fliC mRNA coding for flagellin and the FliC protein itself are absent in an hns mutant, which results in a loss of motility. Moreover, using transcriptional fusions, we showed that an hns mutation results in a 3-fold decreased expression of flhDC, the master operon which controls all other flagellar genes. This was the first example of positive control by H-NS so far described. Similar observations were made in a crp mutant, providing evidence that, like H-NS, the cAMP/CAP complex plays a role of activator on flagellar gene expression. To know whether these regulators could affect flhDC expression by interacting with its promoter, we performed gel shift experiments using purified proteins. The results demonstrated that the flhDC promoter region is preferentially retarded in the presence of H-NS or CAP. Moreover, DNAse footprinting experiments allowed us to determine precisely their binding sites on the flhDC regulatory region. In vitro transcription assays were performed in collaboration with S. Rimsky and A. Kolb (Unité de Physico-Chimie des Macromolécules Biologiques). Surprisingly, H-NS seems to repress flhDCtranscription while the cAMP/CAP complex activates its expression. Finally, in a crp mutant, motility is restored in the presence of wild-type CAP protein but not in the presence of protein mutated in region I involved in the interaction with RNA polymerase. This suggests that the cAMP/CAP complex positively regulates flagellum synthesis by a direct interaction with the C-terminal part of the RNA polymerase a subunit. In contrast, the binding of H-NS to the same region cannot explain its positive control observed in vivo on flagellum synthesis. In this respect, the existence of a long non-coding region between the +1 transcriptional start site and the ATG translational codon seems to play a crucial role in the control of the master operon by H-NS. Finally, to know whether a similar mechanism of flhDC regulation could be extrapolated to other organisms, we analysed the promoter region of an homologous operon recently identified in Photorhabdus luminescens, using a method allowing direct determination of the  nucleotide sequence from genomic DNA. Our results demonstrated the presence of a cAMP/CAP binding site and of a non-translated region (unpublished observations). This suggests that, in this organism, the mechanism of flhDC regulation could be similar to that in E. coli.

The pleiotropic effect of the hns mutation led us to analyse the role of H-NS on bacterial physiology using large scale technologies. In collaboration with C. Laurent-Winter (Laboratoire de Physico-Chimie des Macromolécules) and J.P. LeCaer (Laboratoire de Neurobiologie et Diversité Cellulaire, ESPCI, Paris), we demonstrated that the synthesis and/or the accumulation of about 60 proteins was specifically altered in an hns mutant on two-dimension gel electrophoresis. Many of them were identified by microsequencing or by mass spectrometry. They are found to be involved in bacterial response to various stresses (pH, osmolarity, ...). Moreover, to study the global effect of H-NS on gene expression in E. coli, we analysed, in collaboration with A. Malpertuy (Unité de Génétique Moléculaire des Levures), the transcriptome of an hns strain using DNA arrays. These experiments showed that the expression level of 200 genes was modified in a mutant strain (unpublished). Again, most of them are known to be involved in stress response. In particular, the high expression level of several genes induced by high osmolarity or low pH resulted in a strong increased resistance to both stresses in the hns strain. Moreover, many H-NS target genes with unknown function were predicted to encode fimbriae which could play a major role in virulence processes. These observations provide evidence that an hns mutation cannot be simply considered as a loss of function but can provide a selective advantage to the cell with respect to some stressful conditions. Finally, these observations suggest that the main role of hns could be to control the proton availability in the periplasm of many gran-negative bacteria.

Until recently, H-NS had been only characterised in enterobacteria. In collaboration with S. Goyard (Unité de Biochimie des Régulations Cellulaires), an H-NS-like protein was identified in Bordetella pertussis, the aetiological agent of whooping-cough. Its structural gene was isolated and sequenced. Its product showed a significant similarity with H-NS, in particular in the C-terminal domain. Moreover, the screening of databases allowed us to identify a related protein in Rhodobacter capsulatus. In silico analysis of their amino acid sequence (secondary structure prediction,  presence of hydrophobic clusters, ...) in collaboration with R. Brasseur (Centre de Biophysique Moléculaire Numérique, Gembloux, Belgium) suggested that these proteins were structurally related. Moreover, amino acid sequence alignment demonstrated the existence of a consensus in their DNA binding domain. The structural gene of these proteins was cloned after PCR amplification and proteins were expressed in an hns strain of E. coli. These experiments showed that all proteins are able to complement the phenotypic alterations in such a strain (loss of motility, reduction in growth rate, serine susceptibility, ...). Gel retardation experiments performed with purified proteins revealed a preferential binding to curved DNA similar to that of H-NS. Cross-linking experiments showed that, despite a low amino acid conservation in their N-terminal domain, these proteins are able to dimerise in vitro. These observations are the first demonstration that proteins structurally and functionnally related to H-NS are widespread in Gram-negative bacteria. Moreover, by complementation of the serine susceptibility of hns mutants in E. coli, we recently isolated and characterised an hns-like gene in Vibrio cholerae, the agent of cholera disease. Similarily, in collaboration with P. Glaser (Laboratoire de Génomique des Microorganismes Pathogènes), we identified two H-NS-like proteins in P. luminescens, an entomopathogenic bacterium whose genome was sequenced at the Pasteur Institute and published in 2003. These results further supported the existence of a large family of H-NS-like proteins in microorganisms.

3. Pyrophosphate effects on Escherichia coli: a link with iron metabolism (F. Biville, E. Turlin, M. Perrotte, C.-K. Wun, and several trainees)

In the course of the study of cAMP synthesis in E. coli, the effect of pyrophosphate, a product of the reaction producing cAMP from ATP was investigated. A first series of experiments demonstrated that, in a phosphate-rich minimal medium pyrophosphate had a surprising stimulating growth effect. This effect resulted in a significant modification of the expressed proteome pattern of the cells. This could not be due to a phosphate starvation, and the first hypothesis which came to mind was that energy from the energy-rich bond of the molecule was somehow recovered by the cell. However all experiments meant to explore this hypothesis were unsuccessful. In particular the non hydolysable analog methylene diphosphate had an effect similar to that of pyrophosphate. Analysis of the metabolic activities which varied upon pyrophosphate addition suggested that the tricarboxylic acid cycle was somehow involved. Further exploration demonstrated that the pyrophosphate effect is mimicked by addition of excess iron to the medium. This demonstrated first that, even in a medium supplemented by 5 mM iron, there is still some iron deficiency in a phosphate rich minimum medium, and, second, that the pyrophosphate molecule somehow helps the cell to scavenge existing iron in the environment in a way which permit it to strive on a low iron level (M. Perrotte thesis). Preliminary work demonstrated that a phosphorelay system (two-component regulator) of unknown function is involved in this process. When unraveled this will add interesting information on a set of genes of unknown function in the genome of E. coli and will contribute to improve its annotation. This work has been discontinued when F. Biville left the Unit.

4. Functional analysis of the B. subtilis genome: polyamines and sulfur metabolism (JY Coppée, P. Glaser, M.-F. Hullo, I. Martin-Verstraete, E. Presecan, A. Sekowska, C.-K.  Wun)

Among the aims of genomes functional analysis is the possibility to rapidly reconstruct entire metabolic pathways. This cannot be done using in silico analysis alone, because many proteins have a common descent. This results in the fact that related activities often share similar sequences (e.g. a decarboxylase specific for a given amino acid must be similar to its counterpart specific for another amino acid). We have therefore constructed relatively rapid tests on plates with molecules or ions that could help us to trace as efficiently as possible genes involved in integrated metabolic pathways. Amino acid metabolism is not well described in B. subtilis, and although quite a few gene similarities point to expected enzyme activities, it is necessary to validate the hypotheses derived from these similarities. We used amino acid analogs or certain types of antibiotics is a way to achieve this goal. In addition, we set up several growth condition tests (in particular for swarming or gliding on plates) to test for more subtle phenotypes (A. Sekowska, thesis dissertation).

In the course of this systematic analysis, we remarked the importance of intermediary metabolism activities. In particular, polyamines, although dispensable under routinely used laboratory growth conditions, are extremely important for the cell. They are involved in macromolecular syntheses, and in particular in modulating the accuracy of translation, at steps which may be essential for survival of the cell populations. Their importance is reflected by the fact that their biosynthesis is energy costly. This is especially true for the larger molecules, such as spermidine, spermine and their analogues. In particular, spermidine synthesis requires S-adenosylmethionine (AdoMet) as a precursor. Surprisingly, AdoMet is not used as such in the reaction but is first decarboxylated to 3-aminopropyl-S-adenosine (dAdoMet). The aminopropyl- moiety of the substrate is subsequently transferred onto one of the amino-terminal ends of putrescine, to generate spermidine. A further transfer on spermidine yields spermine in some organisms.

Transamination and decarboxylation are ubiquitous steps in intermediary metabolism. They are generally achieved by enzymes carrying pyridoxal phosphate as a co-enzyme. However, a noteworthy feature of the known AdoMet decarboxylation reaction is that it is achieved by an enzyme carrying not a pyridoxal but a pyruvoyl group as the catalytic residue. Pyruvoyl enzymes perform a limited number of varied decarboxylation reactions; comprising the decarboxylation of AdoMet in Eukarya and Gram-negative bacteria. Combining gene disruption experiments and biochemical identification of polyamines, we unravelled the main features of polyamine biosynthesis in B. subtilis, showing that the predominant pathway proceeds from arginine via agmatine. We also observed that, in contrast to E. coli, B. subtilis does not maintain a significant intracellular pool of putrescine under conditions where the level of spermidine is similar to that found in E. coli. We further identified the pathway leading to the addition of an N-propylamine group to putrescine, creating spermidine. This reaction yields the sulfur-rich molecule, methylthioadenosine (MTA) as a by-product. We identified the nucleosidase encoded by themtnN (yrrU) gene as the first enzyme implicated in its recycling. By gene disruption, in vitro mutagenesis, cell-free protein synthesis and biochemical analysis of polyamines, we showed that the unknown gene ytcF, renamed speD, codes for the decarboxylase. Analysis of the phylogenetic relationships among bacterial enzymes demonstrated that the B. subtilis enzyme is very similar to several predicted proteins of unknown function from Archaea. The MJ0315 gene, which presumably encodes an AdoMet decarboxylase of Methanococcus jannaschii, was used to complement B. subtilis ytcF and E. coli speD mutants and was expressed in a cell free system and we could thus identify for the first time the nature of the corresponding speD gene and protein in Archaea.

While the number of genome sequences increases exponentially it remains difficult to identify gene functions explicitely. Automatic annotation procedures rest mostly on sequence comparisons. They are used to build up phylogeny trees, where reference activities are assumed to spread to neighbours by contiguity. The corresponding functions are thus described tentatively as identical to that of the known reference. However, these methods do not address the central question of enzyme recruitment for new activities. Furthermore, genes and proteins are not simply sequences of letters, they are made from chemicals deriving from the cell metabolism, and a single gene alteration may result in a general base or amino acid content bias, changing the "style" of an organism, possibly altering its place in calculated phylogenies, thus leading to wrong assignments in enzyme activities. Ouzounis and Kyprides constructed an interesting evolutionary tree of agmatinases, with emphasis on their universal presence. Since this seminal work, many new sequences have been obtained and annotated by their similarity with the known sequences. We undertook a comparative analysis of the corresponding set of sequences. Genes that were deemed important were cloned and attempts were made to identify their functions. We first considered the usual types of phylogeny trees constructed on the variation of the amino acid sequence in these proteins, without taking into account the presence of gaps in the sequences. Several discrepancies with respect to the expected position of some organisms in the trees were found. In a second approach, we reconstructed trees based only on the presence and evolution of gap-containing regions in the sequences, because gaps would be much less sensitive to genetic drift or amino acid metabolism. The crucial enzyme activities that presumably evolved from ancestral ureohydrolases were validated by cloning, expressing and measuring activity of the corresponding enzymes. The emerging picture is consistent with a bacterial origin of hydrolases (ureohydrolases and related activities), which later evolved to those of the Archaea and the Eukarya. Our experiments therefore validate the use of gap-trees in the prediction of gene function.

All this work prompted us to analyse the related metabolism of sulfur (A. Sekowska, thesis dissertation and review article), still poorly described in most organisms, and this was be a central area of the research in functional genomics developed in the next few years both at the HKU-Pasteur Research Centre, and at the Genetics of Bacterial Genomes Unit.

History of the HKU-Pasteur Research Centre