Subtractive and comparative genomics

History: Research Programme of the HKU-Pasteur Research Centre (2000-2003)

Summary

The collaborative project aimed at setting up a laboratory of functional bacterial genomics at Hong Kong University in association with the Institut Pasteur in Paris, combining in vivo and in vitro experiments to in silico analysis of the genomes. The Vice-Chancellor of the University of Hong Kong viewed the Centre as the Microbiology pole of the Genome Centre to be created at the University, for the development of genomics in Hong Kong and the region. In order to demonstrate its feasability, the project used two model bacteria Escherichia coli and Bacillus subtilis, together with poorly known counterparts that are important for the control of pathogens and human health and environment, Photorhabdus luminescens and Bacillus cereus (anthracis). These latter bacteria were in addition chosen for their appropriate behaviour under laboratory conditions.

Many publications came out of this programme (they can be seen as publications marked with a star at the site of the Genetics of Bacterial Genomes Unit) and several genome programmes were completed during this programme as a collaborative effort: Leptospira interrogans (with the Genome Centre in Shanghai), Staphylococcus epidermidis (a Firmicute, in collaboration with Fudan University in Shanghai) and, naturally, the genome of Photorhabdus luminescens (see also our databases, that are continuously evolving).

Details of the research planned

1. Introduction

Louis Pasteur used to say that "Scientific knowledge is the heritage of mankind". Having as a final goal the development of an infrastructure based on the creation and usage of normalised genomic data sets, namely genome sequences and their functional annotation, the involvement of a broad community, in particular at places that have been recently more or less kept out of the main stream of scientific production was perceived, at the onset of the Centre, as a critical issue. Furthermore, it was clear, following the remarks of Louis Pasteur himself, that on should never speak of applied research, but, rather of application of the discoveries made by scientists involved in research: applications first need discoveries. This meant that, to study agents causing diseases, for example, one needed to study first appropriate models that would be amenable to in-depth experimental investigation. Two classes of bacteria were important in this respect, the Firmicutes (the model of which is Bacillus subtilis, and which comprise the deadly agent Bacillus anthracis) and the enterobacteria (the model of which is Escherichia coli, the best known organism, and which comprise, as we shall see later on, Photorhabdus luminescens, an insect pathogen that would be extremely interesting against the fight against mosquitoes, that might soon plague Hong Kong again, because of general environment changes).

Indeed we recognized that the added-value of the resulting concepts, experimental results and products will be directly proportional to our capacity to stimulate the production and the exchanges of very heterogeneous data sets (physiology, functional analysis, structures, …). This required the promotion of a standard at an international scale. China, with its ancient civilisation, and its huge population was certainly a place that first deserved consideration in this respect. More specifically, to allow easy and efficient interaction with laboratories everywhere in the world, it was clear that a place already well connected with the rest of the world — in particular in terms of connection to the Internet — was needed. Hong Kong, with its century old University, seemed remarkably suited for the startpoint of a project in genomics meant to have important developments in the future. It should be emphasised at this point that this project was conceived as an integrative and collaborative project, that would be of interest for a variety of different laboratories at HKU and elsewhere in Hong Kong and in China in general.

2. State of the art (as seen at the beginning of year 2000, see our databases)

Genome sequencing, now a readily available technology, provides us with an immense variety of genes and gene products. Unfortunately, knowing the genome text only does not generally bring about the real knowledge that we wish to possess on the function of individual genes, nor on the way they collectively interact. As a matter of fact, it becomes necessary to annotate the genome text, i.e. correlate the text to the corresponding biological knowledge. Identification of gene functions requires the combination of several domains of experiments. This research activity is usually named functional analysis (or functional genomics). It makes use both of computers (in silico analysis, for short, Danchin et al. 1991), and molecular genetics and biochemistry (in vivo and in vitro experiments). Genome sequencing projects combined with functional analysis produces a flow of data which puts a particular focus on relationships between the biological sequences. The need for models, methods and tools to manipulate in vivo, in vitro and in silico sequence networks or clusters is becoming more and more compelling. Indeed, the current lack of such infrastructure obviously creates a bottleneck for the exploration and exploitation of the enormous amount of biological information thus made available. The objective of the current research programme is precisely to illustrate, with model cases, how such research could be organised and rapidly rewarding. This should trigger a broad discussion to initiate a further effort to build larger infrastructures, that would readily place Hong Kong and China among the leading few in functional genomics.

Until recently, there existed only a few major projects of functional analysis of model genomes, among which one for B. subtilis (Harwood and Wipat 1996), the other for S. cerevisiae (Eurofan). A project for the best known model, E. coli, is being started in Japan, where, in parallel, functional analysis of the model of cyanobacteria Synechocystis PCC 6803 is being carried out (at the Kazusa Institute). In these projects, each gene of the organism is disrupted, one at a time, and fused to reporter genes, with the hope to discover phenotypes that may help to identify the corresponding functions. To our knowledge we do not possess heuristics permitting direct access to unknown functions, and apart from preliminary studies by two of the participating applicants there does not exist similar in silico work in the world. However, an excellent illustration of the concept of neighborhood has been developed with the software Entrez, created by D. Lipman and colleagues at the NCBI. The discovery of a major function of a long known enzyme, polynucleotide phosphorylase, is a direct illustration of the outcome from combining different neighborhoods (Danchin 1997, Nitschké et al. 1998) . Significant development has been achieved on the experimental side of technology. However, this has been much slower than initially expected, and protein identification by mass spectrometry, although efficient, is slower than usually assumed. The same is true for global analysis of the transcription products. The results obtained at the end of this work will therefore be, in any case, a useful contribution to our basic knowledge of bacteria.

**3. Genomics of Photorhabdus luminescens and Bacillus cereus (anthracis)**

Two contrasted images depict genomes: at first sight genes appear to be distributed randomly along the chromosome. In contrast, their organisation into operons or pathogenicity islands suggests that, at least locally, related functions are in physical proximity. We proposed, to try and understand genome organisation, to explore in vivo, in vitro and in silico, the distribution of genes along the chromosome, generalising the concept of neighborhood to many more types of vicinities than succession in the genomic text. Our first observations suggested that this order is far from random, but is indeed linked to the function of genes in relation with the cell's architecture (Guerdoux-Jamet et al. 1997, Danchin & Hénaut, 1997). These results were fragmentary, but they now have been experimentally validated (see Publications with the Centre). In collaboration with the newly created Unit of Genetics of Bacterial Genomes and the Laboratory of Genomics of Microbial Pathogens at the Institut Pasteur, we combine in silico analysis of the genomes (bioinformatics) of two organisms Photorhabdus luminescens and Bacillus cereus (or its neighbour B. anthracis), related respectiveley to two model organisms, Escherichia coli and Bacillus subtilis, to their study in vivo (reverse genetics and physiological biochemistry, in particular using two-dimensions protein electrophoresis coupled to mass spectrometry (in the Laboratory of Structural Chemistry of Macromolecules, but also at Pasteur Institute's Genopole Technological Platforms), with biochemical and structural analyses using optical and electron microscopy, to further explore this hypothesis.

Needless to say, even if our working hypothesis is partially falsified, our approach is both precise and original enough to allow us to propose new hypotheses and discover new functions of orphan genes, for which the usual "functional analysis" approaches are ineffective. Indeed we could systematically propose explicit hypotheses about gene functions, and not, as unfortunately this must be generally the case, give phenomenological descriptions of disrupted mutants… As a specific outcome of this work we constructed a variety of quantitative two-dimensions gel electrophoresis proteins maps of B. subtilis, E. coli and P. luminescens in several metabolic and genetic conditions. In particular, we concentrated on a part of metabolism that has been long studied, then completely overlooked, despite its major importance in the control of gene expression (metabolism of polyamines and related molecules, Sekowska et al. 1998). We discovered new protein complexes in the cell, that were a priori unexpected, comprising proteins with apparently unrelated activities. As for 2-D gel protein spots, the proteins in the complexes were identified by mass spectroscopy intially through a collaboration with the Ecole Supérieure de Physique et Chimie Industrielles, and at the Technology Platform 3 of the Institut Pasteur. When possible, maps were systematically related to morphological analyses of the cell using GFP fusions and fluorescent microscopy. We choosed functions identified as of prime importance for linking the chromosome structure to the cell's architecture among those that might be linked to pathogenicity. Data are being made available via specialised databases on the World-Wide Web.

The sequencing of the first genomes revealed an astonishingly high proportion of genes of unknown function (up to 50 % of the genes). This suggests that the observations made under laboratory conditions cover only a small part of the natural environmental conditions encountered by living organisms. It is fundamental to consider of primary importance the biotope of the organism, and to study this organism under conditions which reflect as much as possible those of the biotope. It is also interesting to choose organisms which play a role in human health and the food industry. Another point of major importance, which will be documented below, is that knowledge increases in an immense way when one becomes able to compare genomes with counterparts that are not too distant, phylogenetically speaking. This is often the only way to identify the elusive signals that control gene expression. Finally, it was useful to compare the data in the process of acquisition with those previously available for the organism.

The particular feature in the study of bacteria is the systematic practice of pure cultures, either in liquid cultures or on Petri dishes. As is well known, model organisms for bacteria are Escherichia coli, and - to a lesser extent Bacillus subtilis (Kunst et al. 1997) —, and only those. But it is also well-known that, in the wild, E. coli does not generally make pure cultures, but, rather, interacts with many other organisms. In a way the situation in the laboratory is an experimental artefact… This is less so for B. subtilis which is often found as a pure culture, in the process of flax retting for instance. It was therefore of the utmost importance to study a parent of E. coli that would, in its normal environment, make pure cultures.

Enterobacteriaceae of the genus Xenorhabdus and Photorhabdus are potent insect pathogens. These are highly pathogenic after inoculation but usually not after ingestion, in contrast with Bacillus thuringiensis (a very close parent of B. cereus). Their natural vectors are nematodes belonging to the families of of Steinernematidae and Heterorhabditidae. In nature, Photorhabdus bacteria live as symbionts in the guts of nematodes that invade insects. Once inside an insect host, the bacteria are released from the nematode, kill their host, and set up rounds of bacterial and nematode reproduction digesting the entire insect, which makes food for large numbers of nematodes. Curiously, the insect corpses, which do not rot, left behind glow in the dark as the microbe produce light in addition to potent insecticides.

After numerous tests, carried out world-wide, these nematode-bacteria couples are considered as efficient helpers in the fight against insect scavengers of cultures. They have broad targets, the most sensitive ones being Coleoptera and Lepidoptera while Diptera are more resistant. It is therefore important to explore their potential as vectors against mosquitoes, which spread the deadly Dengue fever in SouthEast Asia. Their harmlessness for vertebrates is another advantage of these nematode-bacteria couples which are already tested in research and development, and commercialised (Biosys, Koppert NL.). Little is known about their genetics, beside their strong kinship to E. coli, the model organism for Enterobacteriaceae.

From our point of view, it is remarkable that these bacteria form pure cultures both in the nematode (where in fact they do not multiply, but where they are found almost exclusively of other bacteria) and in the insect they kill. This indicates an original mode of life where many factors have to be accounted for: cooperation between cells (quorum sensing), the synthesis of many antibiotics (antibacterial agents and fungicides, used for the killing of potential competitors), and the synthesis of insecticides. Moreover, these bacteria possess an efficient general protein secretion system. These bacteria are in symbiotic interaction with nematodes that are strongly related to the model nematode (the genome of which is now completed) Caenorhabditis elegans. This is a further positive factor to understand symbiotic interaction at the genetic level. Finally, luminescence in Photorhabdus is a convenient marker for virulence, allowing to follow easily certain aspects of this complex process.

After completion of their genome sequence by the Laboratory of Genomics of Microbial Pathogens at the Pasteur Institute (see GOLD for monitoring genome programmes) in Paris from a strain isolated by the Laboratory of Comparative Pathology in Montpellier, these bacteria constitute an efficient tool for the study of unknown functions in Enterobacteriaceae (in particular the process of colony formation), but also the synthesis of toxins, the secretion of proteins, and the synthesis of antibiotics. Finally, these bacteria seem to be an ideal paradigm for the study of nematode-bacteria interactions, allowing one to test new genetic methods, applicable afterwards to more complicated models such as mammals, or in general to pathogenic host-vector interactions.

An attempt to create a group devoted to the study of this organism has therefore been set up at the Centre, initially with the help of a scientist (Christian Tendeng) employed by the French Ministry of Foreign Affairs. Later on a senior scientist was recruited to organize and develop work on this topic.

What is known about Photorhabdus and Xenorhabdus

These bacteria present numerous advantages since they synthesise metabolites and proteins of interest for the agro-food and pharmaceutical industry:

Insecticides

Photorhabdus has at least one insecticidal protein complex. These toxins are active after ingestion and injection. Some of their targets seem to be the epithelium of the insect gut.

Antimicrobial agents

As expected from their life cycle, the synthesis of antimicrobial agents is a characteristic feature of both bacterial genera. Different antibiotic families are represented (including sulfur containing molecules): xenorhabdines and indol derivatives are found in Xenorhabdus. Photorhabdus produces two other chemical families: hydroxylbenes and polyketides (the latter correspond to colored pigments, antraquinones). These are large-spectrum antibiotics especially active on Gram-positive bacteria but also on Gram-negative bacteria, while others act as fungicides or immunomodulators. Many genes corresponding to these families are visible in the genome sequence (preliminary data).

At least one bacteriocidal protein (bacteriocin) has been characterised, it corresponds to a phage tail with an antibiotic activity restricted to two entomopathogenic genera and to Proteus sp.

Pathology

The majority of the strains are highly pathogenic and represent, among entomopathogenic bacteria, the group which is the most pathogenic by injection. A single cell is indeed sufficient for the killing of Galleria mellonella (greater wax moth). However, an X. poinarii species exists with no larvicidal activity. In addition,as most enterobacteria, P. luminescens is subject to phase variation. These observations are consistent with the presence of pathogenicity islands in these bacteria. A few studies reported human infection with bacteria from this group.

Symbiosis

At the end of the parasitic cycle, the bacteria cause septicemia, and trigger the development, sexual differentiation and reproduction of the nematodes. However, it is not known which bacterial metabolites or enzymes are involved in this co-metabolism. These bacterial products may constitute one of the key factors for the simplified mass production of nematodes. For this reason it is extremely interesting to study intermediary metabolism of these bacteria, and to compare it to its counterpart in the E. coli model.

Exoenzymes

These bacteria secrete numerous enzymes in large amount during their stationary phase of growth. Some of these enzymes have been partially characterised at the biochemical level: a Photorhabdus alkaline metalloprotease, a chitinase, a Tween 80-lipase, and a Xenorhabdus lecithinase-like protein. An exonuclease activity has also been observed for many Photorhabdus and Xenorhabdus strains but no further characterisation has been carried out.

Quorum-sensing and bioluminescence (see the bioluminescence Web page)

It has long been known that bacteria form colonies on agar plates. If the medium is appropriate these colonies give rise to bacterial swarming. In the late 1960s it was observed that cultures of Vibrio fischeri (a luminescent Gram negative bacterium that colonizes squid) remained non luminescent during the first hours of growth, during which time the number of cells increased. Luminescence appeared when the population reached a significant density, at a moment when bacteria ran out of nutrients. This collective behavior meant that a bacterial function was expressed at a certain cell density; the organisms in the population were sensing each other. This was termed "quorum sensing".

This is the best studied function since these genes are used as reporter genes in many studies. Photorhabdus is the sole bioluminescent bacterium living in soil (while many photoluminescent bacteria are found in water, in particular sea water).

Bacterial physiology

Certain strains of the genus Photorhabdus are psychrotrophs (adapted to low temperatures). The causes for this adaptation are not known and deserve investigation. Related enterobacteria will be studied to clarify some of the issues raised by adapation to low temperature. However, other strains can grow at temperatures up to 39 °C, in particular tropical strains.

What is known about Bacillus spp.

The following uses in part an entry written by AD for the Encyclopedia of Genetics (Academic Press).

Bacilli form an homogeneous class of endosprore forming bacteria that are generally positive for the Gram coloration test. These bacteria are chemoorganotrophs, and usually aerobic or facultatively anaerobic. The best known model is Bacillus subtilis, the genome of which has been sequenced (Kunst et al. 1997). The genome of the pathogen B. anthracis is being sequenced at The Institute for Genomic Research (TIGR), and parts of the genome of the entomopathogen B. thuringiensis have already been published. Unlike most other bacterial species, endospore-forming bacteria are highly resistant to the letal effects of heat, drying, many chemicals and radiation. In fact, one fashionable hypothesis of the origin of life on Earth by panspermia (Sven Arrhenius, and more recently Francis Crick) rests on the notion that bacterial spores such as thoses of B. subtilis could travel through space and survive for millions of years. Despite its nice appeal to wild imagination, this hypothesis essentially puts the investigation of the origin of life out of our reach…

Bacillus cereus is widely distributed in nature. It is commonly found in soil, herbaceous plants, milk and dried foodstuff. It is occasionally the cause of gastroenteritis. Bacillus thuringiensis is an insect pathogen obtained from diseased larvae. It can be considered as a subtype of B. cereus, from which it differs mainly by the presence of parasporal endocytoplasmic toxin crystals. These bacteria share many common features.

The envelope of the vegetative cell

Gram positive bacteria have complex envelopes comprising one bilayer lipid membrane separating the cytoplasm from the exterior of the cell. The membrane is part of a very complex structure that comprises many layers (up to 40) of murein, or peptidoglycan, a complex of peptides containing D-amino acids (in particular meso-diaminopimelic acid) and aminosugars. The cell envelope also has several layers of teichoic acid.

The possible existence of a periplasm in Bacilli in a distinct cell compartment surrounded by the cytoplasm membrane and the cell wall is a controversial issue. Cytoplasm, membrane, and protoplast supernatant fractions were prepared from protoplasts generated from phosphate-limited cells. The protoplast supernatant fraction was found to include cell wall-bound proteins, exoproteins in transit, and contaminating cytoplasmic proteins arising through leakage from a fraction of protoplasts. By this operational definition, 10% of the proteins can be considered "periplasmic".

Sporulation

Upon starvation, B. subtilis stops growing and initiates sporulation. This developmental process involves the differentiation into two cell types. The process begins with a reorganization of the cell cycle that leads to the production of cells whose size and chromosome content is appropriate for the developmental process. The formation of the two cell types, a forespore and a mother cell twice as large as the forespore, with differing developmental fates is the first morphological indication of the early stages of sporulation. Endospore formation is a multistep process that is common among Bacilli. This seemingly simple structure is the product of a very complex network of interconnected regulatory pathways that become activated during late growth in response to unbalanced nutritional shifts and cell cycle related signals. Sporulation starts with stage 0 (vegetative growth). Symmetrical cell division, characteristic of vegetative growth, is blocked. Instead, the cell divides asymmetrically to produce a small polar prespore cell and a much larger mother cell. During stage I, asymmetrical preseptation starts. The cellular DNA takes the shape of an axial filament. At stage II septation proceeds and the daughter chromosomes are separated. Spore development follows at stage III (engulfment of the forespore and complete separation of the spore membrane from that of the mother cell). Stage IV involves formation of the spore cortex. In stage V spore coat proteins are synthesized and assembled. At stage VI the spore becomes highly refractile under the microscope, and it acquires heat and stress resistance. Finally, the programmed death of the mother cell occurs, leading to lysis and release of the mature spore (stage VII).

The spore coat is a complex envelope comprising several layers of spore coat proteins that protect the almost entirely desiccated interior of the spore, where DNA is compacted and protected from harmful influence of the environment. Under conditions of appropriate moisture, in media that contain alanine, glucose and minerals, spores can germinate. This process involves swelling and a complex lytic process that opens and sometimes degrades the coating envelope during which time metabolism is initiated. Cells then resume normal vegetative growth.

Quorum-sensing and chemotaxis

A variety of processes are regulated in a cell-density- or growth-phase-dependent manner in Gram-positive bacteria. In the early 1990s quorum sensing was discovered in B. subtilis. It was certainly linked to sporulation (to swarm or to sporulate, that is the question), but the functional reason(s) for the existence of the process are not yet known. Most bacteria that use quorum sensing systems inhabit an animal or a plant. The microorganisms benefit from the process, but the host organism may or may not. Each bacterium produces small diffusible molecules that allow cell to cell communication. As the population of bacteria increases, so does the concentration of the signalling molecules. Sensors recognize these molecules. Once the local concentration in the medium has reached a threshold value, the sensor proteins transmit a signal to a transcriptional regulator. Examples of such quorum-sensing modes are the development of genetic competence in B. subtilis and Streptococcus pneumoniae, the virulence response in Staphylococcus aureus, and the production of antimicrobial peptides by several species of Gram-positive bacteria including lactic acid bacteria. It seems likely that similar processes exist in B. cereus and B. thuringiensis.

Bacterial populations coordinate their activities. Cell-density-dependent regulation in these systems appears to follow a common theme. First, the signal molecule (a post-translationally processed peptide-pheromone) is secreted by a dedicated ATP-Binding-Cassette (ABC) exporter. The role of the secreted peptide pheromone is to function as the input signal for a specific sensor component of a two-component signal-transduction system. Co-expression of the elements involved in this process results in self-regulation of peptide-pheromone production. Peptides are secreted and processed, under various conditions that are further recognized by the cell. Next, in response to the pheromone, cells swim in a coordinated fashion, thereby forming a kind of wall surrounding rings of bacteria having the same exploration behavior.

In the same way, B. cereus is a motile bacterium, endowed with a complex flagellar machinery. This permits cells to swim towards from nutrients or away from repellents.

Protein secretion

Bacillus subtilis is one of the organisms of choice in the study of protein secretion. Many fundamental aspects of this process are not yet understood. At least two systems enable proteins to be inserted into the membrane and/or to be located outside of the membrane or secreted into the surrounding medium. In B. subtilis, the Sec-dependent pathway (one that recognizes signal peptides) has at least five different signal peptide peptidases. Proteins that are periplasmic in Gram negative bacteria are also found in B. subtilis, presumably as lipoproteins (i.e. possessing a specific signal peptide, cleaved upstream of a cysteine residue that is covalently coupled to the outer lipid layer of the cell membrane upon cleavage). Comparison with the genome of B. cereus will help better understanding of this important process.

Metabolism

In addition to the need for compartmentalization, living cells must chemically transform some molecules into others. Metabolism is the hallmark of life. Cells can be in a dormant state — this is the case of spores, for example — but one cannot be sure that they are living organisms unless, at some point, they initiate metabolism. In general, one distinguishes between primary metabolism (the transformation of molecules that support cell growth and energy production), and secondary metabolism (transactions involving molecules that are not necessary for survival and multiplication, but assist in the exploration and occupation of biotopes (e.g. antibiotic synthesis)).

Carbon, oxygen, nitrogen, hydrogen, sulfur and phosphorus are the core atoms of life. Electron transfers and catalytic processes, as well as the generation of electrochemical gradients, require many other atoms, in the form of ions. Metabolic processes allow the cell to concentrate, modify and excrete ions and molecules that are necessary to energy management, growht and cell division.

Nutrients and ions are transported into cells by a number of more or less specific permeases, most of which belong to the ABC permease category. In Gram positive bacteria, these permeases generally comprise a binding lipoprotein responsible for part of the specificity, located at the external surface of the membrane, an integral membrane channel made of proteins of two different types, and a dimeric, membrane bound cytoplasmic complex, that binds and hydrolyzes ATP as the energy source.

For positively charged ions, selectivity is the most important feature of the permease, because the electrochemical gradient is oriented towards the interior of the cell (negative inside). Ions must be concentrated from the environment until they reach the concentration required for proper activity, but must not reach inhibitory levels. Apart from iron (which is scavenged from the environment with highly selective siderophores synthesized in response to iron limitation), manganese is the most important transition metal ion for Bacilli. It is required for many enzyme activities (such as superoxide dismutase, agmatinase, phosphoglycerate mutase, pyrophosphatase, etc.). Copper is important for electron transfer. Cobalt is required by the important recycling protein methionine aminopeptidase. Nickel is needed by urease… Zinc is a cofactor of polymerases and dehydrogenases, magnesium is involved in catalytic complexes with substrate in about one third of enzyme reactions. Potassium is needed to construct the electrochemical gradient of the cell’s cytoplasm, and is a likely cofactor in many reactions. Calcium is perhpas needed in major reactions during the division cycle, but the importance of this ion still remains a mystery.

Anions are also important. They have to be imported against a strong electrochemical gradient. Phosphate in particular requires a set of highly involved transport systems. Sulfate is the precursor of many important coenzymes in addition to cysteine and methionine, but not much is yet known about its transport and metabolism, except by comparison with the counterparts known to be present in E. coli. The study of sulfur metabolism is therefore a priority.

Carbon and nitrogen metabolism in Bacilli follow the general rules of intermediary metabolism in aerobic bacteria, with a complete glycolytic pathway and a tricarboxylic acid cycle. Electron transfer to oxygen is mediated by a set of cytochromes and cytochrome oxidases, allowing efficient respiration. Bacillus cereus is able to grow anaerobically with appropriate electron acceptors.

As in other living organisms, the ubiquitous polyamines putrescine and spermidine play a fundamental but still enigmatic role. They arise via the decarboxylation of arginine to agmatine, coupled to a manganese-containing agmatinase, and not from decarboxylation of ornithine, as in higher eucaryotes.

Pathogenicity

Bacillus subtilis is generally recognised as safe (GRAS). It is much used at the industrial level for both enzyme production or for food supply fermentation. Riboflavin is derived from genetically modified B. subtilis using fermentation techniques. Traditional techniques (e.g. random mutagenesis followed by screening; ad hoc optimization of poorly defined culture media) are important and will continue to be utilized in the food industry, but biotechnology must now include genomics to target artificial genes that follow the sequence rules of the genome at precise position, adapted to the genome structure, as well as to modify intermediary metabolism while complying to the adapted niche of the organism, as revealed by its genome. As a complement to standard genetic engineering and transgenic technology, knowing the genome text has opened a whole new range of possibilities in food product development, in particular allowing "humanization" of the content of food products (adaptation to the human metabolism, and even adaptation to sick or healthy conditions). These techniques provide an attractive method to produce healthier food ingredients and products that are presently not available or are very expensive. B. subtilis will remain a tool of choice in this respect.

Most Bacilli are innocuous, but Bacillus cereus is the etiological agent of several types of infections, including fulminant ophtalmitis. Its kins B. thuringiensis and B. anthracis are highly pathogenic bacteria (the former for insects, the latter for animals and man). Bacillus anthracis is the agent of the deadly anthrax disease. Louis Pasteur discovered the bacteria responsible for anthrax. This led to an immunization protocol using a weakened strain of the bacteria. All his research led Pasteur to crusade for sterilization and hygiene to prevent the spread of infectious diseases. The etiology of the disease has been remarkably analysed by Robert Koch. Several studies suggest that the agent has been used for warfare (see Centre for Nonproliferation Studies), and although it is not clear how much propaganda there is in the general reports and talks about this organism (see Not every truth is good to say), it is clear that it is important to understand its biology.

4. A work plan: the exploration of "neighbourhoods"

Because P. luminescens (resp. B. cereus) is related to E. coli (resp. B. subtilis) we have started its study by annotating genes in comparison with E. coli (resp. B. subtilis). This required us to study in priority genes that are present in both organisms, but do not yet possess an assigned function. As an initial step we have set up specialised databases that allow one to manipulate cognate genomes in parallel. This asked us to elaborate a new data structure.

In silico exploration of neighborhoods

To proceed rapidly towards understanding of the function we used the concept of neighbourhood, for which we have developed an efficient tool, available on the World-Wide Web (Nitschké et al. 1998). The first neighbourhood (relationship) between genes highlighted by genome projects is the proximity on the chromosome. From the gene point of view, this relationship represents its physical neighbourhood. The existence of operons and pathogenicity islands shows, at least for prokaryotes, that genes physically neighbours from each other can be functionally related. Formally, this means that, for these genes, a relationship exists between two types of neighbourhood: physical and functional.

The definition of "Neighbourhood" can be generalised to many more types of vicinities. We have just introduced physical and functional neighbourhoods and another interesting one is certainly the structural similarity between genes or gene products. In fact, one can use most attributes of the sequences to define clusters reflecting a particular type of neighbourhood. For example, a simple property as the isolelectric point, which often gives a first idea of a gene product compartmentalisation, can be used (Moszer et al., 1995). More complex neighbourhoods have proved to produce particularly revealing results: two genes may be considered as neighbours because they use the genetic code the same way.

Remarkably, results of functional analysis perfectly and naturally fit in the neighbourhood model. This is obvious for metabolic pathway identification (Sekowska et al. 2000), and it is also true for differential expression studies, using 2D electrophoresis of proteins or DNA array technology, which produce network of co-expressed sequences that can be seen as another type of neighbourhood. In fact, the neighbourhood concept is very well suited to represent biological knowledge from the sequence standpoint. When several genes are identified as being involved in a given biological function, as soon as this result is published, one can find neighbours in the literature because they appear in the same book or article. This means that most of the knowledge concerning functional neighbourhood is almost immediately available in the literature. Systematic extraction of this bibliographic vicinity is an amazingly rich and powerful resource.

Neighbourhood, as a formal model, offers a unified representation of available data: it handles functional knowledge as well as any structural or physico-chemical properties. Heterogeneous types of information are projected as networks of sequence relationships. Comparing or bringing together data of very various types becomes possible. The most promising consequence is the possibility to imagine, design and develop tools which will support an inductive exploration of biological data, organised in appropriate database structures (Médigue et al. 1993, Moszer et al. 1995, Bocs et al. 2002). Each neighbourhood is meant to shed specific light on a gene and bringing together the objects of the neighbourhood is a way to look for its function. This induction, as opposed to the usual deductive reasoning, results in fundamentally new findings.

Identification of relationships between different types of neighbourhood is a result of this exploration. Systematic and statistical analysis of these relationships can subsequently be used to derive predictive methods. This is precisely what is done currently when sequence similarity (structural neighbourhood) is used to transfer functional information between sequences (functional neighbourhood). The great advantage of using other types of neighbourhood is on the one hand, to get functional hints is absence of sequence similarity and on another hand, to bring complementary evidence when sequence similarity does not provide clear-cut answers.

This approach is of great benefit for the growing field of genome comparison. First because sequences with similar functions can be identified with more accuracy. Second, because neighbourhood will help to address the problem of discrimination between orthologs and paralogs. Finally, because the concepts and tools developed to bring together and to compare different neighbourhoods can be used and extended to compare a given neighbourhood in different organisms

In addition to an important contribution in the understanding of the general gene organisation and function in enterobacteria (including the model organism E. coli) the study of Photorhabdus is expected to lead to a large number of applications, in particular in the following domains:

• characterisation of insecticidal toxins, in particular in the fight against mosquitoes (that cause dangerous diseases such as dengue fever in the region)
• production of antibiotics (including fungicides)
• protein secretion (including potential secretion of products of high pharmaceutical value).

Despite its limited human resources in this program, the role of the Centre in Hong Kong has been to provide some in silico, in vivo, and in vitro annotation required to set up the stage for these studies to start in depth investigation of genomes, at appropriate public or private places, preferably in an interactive way.

Polyamines and sulfur metabolism

Among the genes revealed by the in silico analysis described above some have been inactived or expressed in a controlled manner by homologous recombination. Our work was restricted to robust fragments of metabolism or control, in which hypotheses can be validated with strong certainty. As a first step, we have been concentrating on the metabolism of polyamines (because it is ancestral, and universal, but quite diverse, and linked to nucleic acids structures). As an example we have discovered in B. subtilis arginine decarboxylase, S-adenosyl-methionine decarboxylase, and the putrescine metabolism pathways. Generally, we studied operons having a fragmented structure in one of the models and being continuous in the other (this is the case of arginine and polyamine metabolism). In the same way, we inactivated control genes having a specific effect on DNA bending and supercoiling. These genes were chosen because they strongly differ in Gram+ and Gram– bacteria. At this step, appropriate cassettes were used to allow visualisation in optical and electron microscopy (GFP and histidine-tails hybrids). Sulfur metabolism is intimately linked to oxygenases (mono- and dioxygenases), and we are devoting a special emphasis on this class of enzymes, in collaboration with laboratories at HKU and IP Paris.

We are systematically studying heterologous enzyme complexes (focusing on "impurities" during purification). Furthermore, we analyse oligomer organisation of the proteins looking for structures that permit the formation of plane layers. We continuously annotate - back to in silico analysis - the genomic text as a function of these results and study the distribution of genes corresponding to common distinctive features of their products. When needed, we raise antibodies against these proteins, and purify their complexes on immuno-affinity columns, but in general we combine the approach of reverse genetics permitting us to construct GFP fusions, to purification and immunodetection.

Among the physical principles that may organise the cell's architecture we hypothesise that plane layers play a major role. This is being investigated at the ultrastructural level.

Global functional genomics: the transcriptome and the proteome

To benefit from the knowledge of the whole genome sequences we initiated global studies of gene expression. This makes use of techniques permitting one to analyse transcription of the whole set of genes (expression profiling, or transcriptome analysis) as well as of the whole set of gene products (proteome analysis).

The number of genes to be considered is of the order of 4,000-5,000. This number is perfectly compatible with filters (membranes) technology using PCR products (DNA arrays). As a first step, we preferred this technology as a first step, to the microchips technique, that is more expensive and do not allow reusing the same slide for several experiments, hence lackinjg a crucial control. Expression profiling technology is more difficult to develop for procaryotes than for eucaryotes, because mRNAs are not polyA-tailed. In addition, RNA extraction comprises a major amount of ribosomal RNA, leading to a background level when hybridisation is made directly with the RNA preparation (see discussion). We have therefore created a new process for extracting RNA from bacteria, and preparing the corresponding cDNA. We used primers internal to the genes and reverse transcriptase, leading one cDNA copy per RNA molecule present in the RNA preparation. This cDNA preparation is hybridised to the filter, and quantitated using a phosphorimager. We are in parallel collaborating with IP Paris to the development of a database structure to correlate with the genome databases (Colibri and SubtiList, of the "GenoList" class, see below). The corresponding specialized database SubScript, is built with the BACELL consortium.

The mutants are characterised at the protein level using 2-D gel electrophoresis (with training of scientists and research assistants at the IP Paris). Protein spots are identified by mass spectrometry (MALDI-TOF and Nano-Electrospray: 90% can be identified using proteolytic peptides maps and genome databases), and used to enrich our databases. We are improving our electrophoresis techniques along three main lines: gels for large scale purification of certain protein spots, building up and enrichment of 2-D gel databases coupled to genome databases, and identification of low concentration spots by MS (in collaboration with the Laboratory of Neurobiology and Cell diversity at the ESPCI in Paris).

5. Bacterial sequence management and genome annotation

Functional genomics requires, in parallel with in vivo and in vitro experiments, an important effort in bioinformatics, in silico analysis. We are therefore organising data and annotation of the Photorhabdus luminescens and Bacillus cereus genomes in such a way as to help to manage sequence data, and help to predict and discover gene functions, taking into account the knowledge of the genome sequences as they are being collected.

To construct the P. luminescens and B. cereus specialised databases it is necessary to introduce in the GenoList engine sequence data and their corresponding annotations. This input required construction of a generic procedure (elaborated within the process).

Genome annotation is being developed at the Genetics of Bacterial Genomes Unit through the collaborative effort of

We expect that our study will improve our knowledge of enterobacteria, in particular in relation with their reactions to stressful conditions. This will be extremely useful because these bacteria form a large cohort of human and animal pathogens.

In this respect, knowing more about their sulfur and polyamine metabolism has helped in defining new targets for specifically designed chemicals (synthetic artificial antibiotics). It should indeed be noticed at this point that a lead was followed by several companies in this direction, but that inaccurate experimental results diverted their effort to other leads.

7. Extension to the human genome

The sequencing of the human genome is proceeding at a very fast rate. This means that 3 billion base pairs are evaluated by the community of scientists. However, our experience is that identifying gene and functions is much more tedious, inventive and time-consuming than determining a genome sequence (as a matter of fact, the study of one single gene may use tens of men/year of work). This can be easily seen even with viruses with short genomes: bacteriophage lambda has been sequenced almost twenty years ago, and HIV fifteen years ago, and we are far from understanding everything of what should be known about these model viruses. It is therefore important to try to help functional annotation of the human genome. This is the more important because, of course, experiments in genetics cannot be performed on man (be it only for technical reasons, but of course for ethical reasons). How can bacteriology be helpful ? We think that serveral paths can be followed.

A first one is the study of bacterial genes having counterparts in the human genome. For example the cyaY gene of E. coli (the counterpart of which may be existing in P. luminescens, a much better model than E. coli for the study of complex interactions) is the homolog of the frataxin gene, causing Friedreich ataxia, a completely not understood disease. We could of course start to make an analysis of the corresponding function in our models (remembering in particular that mitochondria are the progeny of symbiotic bacteria). Other example exists, in particular in metabolism, that may be worth analysing.

The most promising avenue for collaboration between human and bacterial genomics might stem from the unraveling of pathways in bacteria. Indeed, much remains to be known in these organisms, and we have chosen to investigate the situation of polyamines and sulfur metabolism precisely for this reason of lack of sufficient knowledge. But there is a fundamental reason that links bacteria to higher eucaryotes: mitochondria have bacterial ancestors. And, in fact, because of the deleterious effect of Muller's ratchet (accumulation of deleterious mutations in the absence of sex), most bacterial genes found in the original symbiote had to migrate to the host nucleus.

History: Research Programme of the HKU-Pasteur Research Centre (2000-2003)

Summary

Details of the research planned

1. Introduction

2. State of the art (as seen at the beginning of year 2000, see our databases)

**3. Genomics of Photorhabdus luminescens and Bacillus cereus (anthracis)**

4. A work plan: the exploration of "neighbourhoods"

5. Bacterial sequence management and genome annotation

The GenoStar™ consortium

Geno*™, is a French consortium for the development of an informatics platform for exploratory genomics.

6. Extension to human and animal diseases

7. Extension to the human genome