Teach Yourself Genomics

Note that this page is under permanent construction (2000-2003). Genomics is a very fast changing field, and the landscape will be entirely modified in a few years time.

Frequently asked questions about genes and genomes

A Brief History of the Human Genome Initiative

Genome Sequencing

DNA Chips and Transcriptome Analysis

2D Electrophoresis Gels and Proteome Analysis


In silico Genome Analysis (BioInformatics)

Genomics sites

A Brief History of the Human Genome Initiative

The history of the Human Genome Initiative is deeply imbedded in sociology, psychology, economy and politics. One can find several accounts of its course, and it is probably too early to have a clear picture of the situation of the revolution it is bringing about. The various agencies involved in its setting up have their own historical account.

Remarkably (and in a way that is generally not understood), this programme probably started as a collaboration between the USA and Japan, immediately after atomic bombs crushed Hiroshima and Nagasaki, when questions were asked in the early fifties about the future generation of people born from highly irradiated parents. The Atomic Bomb Casualty commission, which was later included in the activities of the Departement of Energy, initiated studies which played an important role at the onset of the Human Genome Program. This question has been recently asked again, and, of course, knowledge of the human genetic polymorphism brought about by the sequencing of Single Nucleotide Polymorphic (SNP) sites will boost again investigating the question. A personal view of this programme, as a technical feat as opposed to a concrete hypothesis-driven scientific programme can be found in The Delphic Boat, what genomes tell us, Harvard University Press, 2003.

The term Genomics appeared in the literature in 1988, but it was probably used well before that time.

Genome Sequencing

A general description exists in French in La Barque de Delphes, some is translated and adapted here. An adaptation of this book is published at Harvard University Press, under the title The Delphic Boat.

For a general strategy for sequencing bacterial genomes see here.

DNA (standing for DeoxyriboNucleic Acid) is made of the chaining together of four similar chemicals named nucleotides (or abbreviated into the common but somewhat misleading expression "bases"). Four types of nucleotides, noted A, T, G and C form the DNA molecule. The DNA molecule is made of two strands twisted around each other as in an helical staircase:

The chain makes a long thread, which can be described as an oriented text, for example: >TAATTGCCGCTTAAAACTTCTTGACGGCAA> etc. which faces the complementary strand

This is because the building blocks (which can be referred to as "letters") are of four, and only four, types, and that A always faces T and G faces C.

Genomic DNA is a long molecule, composed of hundreds of thousands, millions or even billions of bases. Its chemical properties are original (they are those of a long thread-like polymer), and this allows DNA to be fairly easily isolated.

Isolating DNA

Cells are disrupted by physico-chemical means and DNA is easily separated from the bulk because it is insoluble in alcohol or other organic solvents. The difficult steps in isolating DNA is to break up the cells, without altering the integrity of DNA (it is an extremely thin molecule, the double helix cylinder having a diameter of about 2 nanometers, i.e. 2 millionth of a millimeter). In fact, DNA can easily be seen as a thread-like molecule with the naked eye, and any type of strong stirring or frequent pipetting will shear this fragile polymer. Separation from proteins is achieved by adding enzymes (proteases, such as those found in washing machine powders) which dissolve proteins into water-soluble pieces. Further solvent extraction allows one to obtain DNA precipitated in 70% alcohol. It is collected by centrifugation and thoroughly washed with a water-alcohol mixture containing appropriate chemicals meant to inactivate the agents which might degrade DNA. The protocol is delicate, but it can be performed even in an hotel room (this has happened!) with very simple equipment. DNA can be kept as such in an ordinary refrigerator.
With the advent of fast sequencing as well as fast and heavy duty computing facilities, it became possible to just fragment a genome in a random fashion, to submit each fragment of a library to sequencing. This is called "shotgun" sequencing.

The Polymerase Chain Reaction (PCR)

For an interesting history of the PCR, read the excellent book by Paul Rabinow "Making PCR".

The usual physico-chemical systems we are familiar with differ from living organisms, because the latter multiply by reproduction of an original template. In nature, dispersed systems tend to be found at lower and lower concentration from the original dispersion point as time elapses. In contrast, a single living cell, that is allowed to grow in a given environment, can multiply at a considerable level. In particular, as soon as a DNA fragment is cloned into a replicative element, it can easily yield 106, 109 or even 10^12 copies of itself. It is this fundamental property which explains that one successfully determined the sequence of many genomes. This is also the basis of the (not so irrational) fear of the public for dispersion of genetically modified organisms (GMOs). But although the techniques of in vivo molecular cloning are indispensable to research in molecular genetics, they are techniques tedious and difficult to put into practice. They are not always possible to implement, and they are time consuming (one must isolate the fragments to be cloned, incorporate them into replicative units, then place these units into living cells that one must subsequently multiply). Fortunately there exists, in particular for sequencing purposes, an extremely efficient technique, Polymerase Chain Reaction (PCR), that permits amplification in vitro and no longer in vivo of any DNA fragment that can be obtained very rapidly in considerable quantity.
With this technique, the fragment to be amplified is incubated in the presence of a particular DNA polymerase which is not inactivated at high temperature, together with the nucleotides (the basic DNA building blocks) necessary to the polymerization reaction, and with short fragments (oligonucleotides) complementary to each one of the extremities of the region to amplify. The trick of the method is to use a polymerase that is stable at high temperature (isolated from bacteria growing at very high temperature, up to 100°C and more), and to repeatedly alternate cycles of heating and cooling of the mixture. Typically, when the DNA double helix is heated up, its two strands separate. One then cools down the mixture in the presence of a large excess of oligonucleotides (synthesized chemically and present in large excess in the assay mixture) that will each hybridize to the sequence they must recognize at the extremity of the strand to amplify. In the presence of the polymerase and of nucleotides, these oligonucleotides behave as primers for polymerization (at the chosen temperature sufficiently low to allow for hybridization). One subsequently heats up the mixture to separate the newly polymerized chains, then cools it down. The newly synthesized strands can therefore become new templates, and the cycle repeats itself.
One thus understands that, in principle, one doubles at each cycle the quantity of fragment obtained. At this rate, ten cycles would amplify the fragment one thousand fold (210 times). Forty cycles or so give therefore a very significant quantity of the fragment (there is not an exact doubling at each cycle, of course, and the value obtained, although very high and visible on a molecular size separation gel, will be somewhat lower than that predicted by a doubling at each cycle, 240: one thousand billion times). This method is extremely powerful since it can amplify very low quantities of DNA (theoretically only one starting molecule is enough). Since its invention in 1983, until the beginning of the nineties, PCR could only yield short fragments (a few hundred base pairs), and it displayed a high error rate (of the order of one wrong base per one hundred), but the use of new DNA polymerases, together with better tuned experimental conditions have considerably improved its performance. Mid-nineties it was thus possible to obtain with polymerization chain reaction, amplification of up to 50 kb-long segments, with a very low error rate (one in ten thousand base pairs). It remained however to be carefully verified that these fragments did not contain short deletetions, following an accidental "jump" of the polymerase on its template.
Thus, PCR is a rapid and cheap technique very easy to put into practice, that permits replication of a DNA fragment in a test tube, one million times or more. Because of its simplicity PCR is important, for in particular it can be performed by practically anybody, and amplify without previous purification a DNA target present in very low quantity in almost any medium, a billion times in a few hours. This technique had therefore a huge impact in domains as varied as identification of pathogens, study of fossil species disappeared for millions of years, genetic diagnosis or crime detection (forensic investigations). For our problem, the sequencing of genomes, PCR permits amplification of genome sequences in regions that are impossible to amplify by cloning. In contrast to DNA obtained by PCR, independent of life, DNA clones that multiply in cell lineages may, as we observed in authentic situations, result in toxic properties. For example the DNA of the bacteria that are used to make natto (fermented soy beans) in Japan, Bacillus subtilis, is very toxic when it is cloned in a vector replicating in the universal host for sequencing programs, Escherichia coli. The underlying biological reason is simple, as it has been discovered during the sequencing of the B. subtilis genome: the signals that indicate that gene can be expressed into proteins (the transcription and translation starts in B. subtilis) are recognized in such a strong and efficient way in E. coli, that they override the essential syntheses of their host in profit to those of the segments cloned in the vector, leading the host to its premature death.


Sequencing DNA supposes amplification of the fragments which will be replicated by DNA polymerase in the presence of appropriate nucleotide markers (radioactive or fluorescent). Amplification can be performed in vitro, using PCR, but the most common way to amplify a DNA fragment is to place it into an autonomous replicating unit, which, placed in appropriate cells, is amplified as cell multiplies while it is carried over and copied along the generations, and found in the progeny as a colony, a clone (hence the word "cloning" to illustrate the process).

Making libraries


It is the possibility to perform in vitro what is usually going one in vivo that gave to Frederick Sanger, from the Medical Research Council in Cambridge in England, the idea that one could alter the process in such a way as to determine the succession of the bases of a template DNA. Sanger's simple, and for this reason bright, idea is that it would be enough to start the reaction always at the same place, and then to stop it precisely but at random times after each base type, in a known and controlled fashion, to obtain in a test tube a mixture comprising a large quantity of sequence fragments, each starting at the same position in the newly polymerized strand, but ending randomly after each of the four base types. For example, let us consider the following sequence:
If one knew how to stop the reaction specifically and randomly, after each A one would get in the test tube the following fragments:
In the same way, after each C:
and so on. And if one knew how to screen these fragments simply using their mass (their length), then to place them in a row according to their increasing or decreasing length, then one could construct four different scales corresponding to each type of base at which each step would stop specifically a after a given base type (A, T, C, or G). Placed side by side (if the screen of each base type specific reaction was performed in parallel experiments) they would thus provide directly the studied sequence (telling that the fourth base is an A, the third a C, etc), and this without asking the screening process to know the nature of the considered base but only the length of the fragments that contain it.
Frederick Sanger is an excellent chemist, and he thought of a simple idea to construct in practice this mixture of fragments, interrupted precisely after each of the base types. As a start point, he synthesized chemically a short strand, complementary to the beginning of the strand that he would like to sequence. This is the primer, which will be in common with all newly synthesized strands, making them all start at exactly the same position. Then, to stop the polymerization reaction after a given base type, he used chemical analogs of each nucleotide, in such a way as they are recognized by DNA polymerase in a normal and specific way (they base pair with a base in the complementary strand with the normal pairing rule). These analogs are chosen so that they are incorporated in the strand in the course of normal DNA synthesis, but their structure is such that they prevent polymerization immediately after they have been incorporated into a nascent polymer. Finally, to obtain a statistical (random) mixture of fragments synthesized in this way (one must obtain all possible fragments interrupted after each base type), he mixed up the normal nucleotide (which, of course, does not stop synthesis after it has been incorporated in the growing polymer) with the chain terminator nucleotide, in proportions carefully chosen to let the reaction proceed for some time before being stopped, and give a mixture of all types of fragments, containing long fragments as well as short fragments. This method is named after the chemical nature of the analogs ("dideoxy nucleotides"): Sanger method, chain termination method or DNA polymerase dideoxy chain terminator method.
Thus, by using an analog of A (here noted A*), one obtains, for example, a mixture of the fragments shown above, with sequence GGCATT as a primer, and wishing to copy the following strand:
(which should be written
when read from right to left, to make clear the complementarity rule between the bases of each strand):

And the experiment will consist in performing in a concomitant way the same reaction with the other analogs, G*, C* and T*.
The subsequent step consists in separating the different fragments in the mixture, and to identify them. Separation is easily performed using electrophoresis, by making the mixture migrate in a strong electric field, through the mesh of a gel (this is the fragment separation gel, in short the "sequencing gel") having gaps of a known average size. Migration occurs because of the strong electric charge of the fragments (DNA is a highly charged molecule, with a charge directly proportional to the number of bases it comprises). Practically, one loads a small volume of the mixture at the top of a gel on which is applied an electric field. The different fragments will then move snake-like by reptation inside the gel. And one easily understands that the shorter fragments having less difficulty to go through the gel pores will move faster than the longer ones. One will therefore obtain after some time a ladder separating fragments according to their exact length. The subsequent step will be to identify the place where they are located, to place them with respect to each other.
In the first experiments one used a process — which is still in use today — to label either the primer used for polymerization, or the nucleotides as they are incorporated into the newly synthesized fragment, with a radioactive isotope. As a consequence, the fragments in the gel were radioactive. It was therefore easy to reveal their presence by placing the gel on a photograph plate and detecting the black silver bands in the resulting photograph. To allow detection of all four nucleotides this experiment asked for the running of four separate mixtures coming from four different assays in different test tubes in different lanes in a gel. One thus obtained four mixtures where the fragments were ending at each of the nucleotides A, T, G and C. The mixture then ran side by side. After photograph development, this gives rise to four sequences of black bands where one can directly read the sequence. It is this image of a ladder made of parallel black bands that was interpreted by many journalists as the (rather misleading and difficult to understand) metaphor of a "bar code" as representing fragments of chromosomes. In a standard experiments, for gels 50 centimeters long, one reads 500 bases, with an error rate of about 1%. This has as a consequence that it is usually difficult, and also long and costly, to determine the sequence of a long fragment of DNA with an error rate lower than one in ten thousand.

Of course labeling can be performed using fluorescent chemicals, and detection of the bands uses lasers and light detectors.


The "shotgun" sequencing step produces thousands of sequences which, once correctly assembled, make up the whole original sequence. Suitable algorithms compare the sequences in pairs, looking for the best possible alignment, then identifying overlapping regions and matching them to reconstruct the original DNA. Early programs, developed at a time when computers were not powerful enough to envisage assembling large numbers of sequences, used algorithms which compared sequences in pairs. However, random sequencing of large genomes involves assembling not just thousands of sequences but tens of thousands, even millions, and with the need to handle these increasing numbers of sequences, it became clear that the methods available were not specialized enough. A point was reached where the computational side was holding sequencing projects back, and the biologist had to make up for the inadequacy of the processing by spending time “helping” the assembly programs. In certain cases reconstituting the whole text from the text of fragments can be not merely difficult but even impossible, particularly when the text includes long repeated regions, a common occurrence with long segments or whole genomes.
All this led several laboratories to develop new algorithms, bringing together the biologist’s knowledge, information about the nature of each of the fragments to be sequenced, (such as the sequencing strategy used for a given segment of DNA, the source, and the approximate length), and pairwise or multiple alignment algorithms. As a final stage, so that users can verify the result of the assembly procedure and spot possible errors or inconsistencies, graphic interfaces are used to view and correct the assembly if necessary. This is done either as a whole, by comparing the different segments, or locally, by correcting the raw results of sequencing each fragment. However the aim of using computational methods is to reduce the user’s intervention to a minimum, producing a correct final result directly. An enormous amount of work has been put into this since the middle of the 1990s, and it was crowned with success when the first draft sequence of the human genome was completed in the year 2000.

DNA Chips and Transcriptome Analysis

The interest for DNA chips is that they provide large scale information on thousands of genes at the same time. This very fact implies some sort of statistical reasoning when studying the intensities of the spots analyzed.

1. What kind of intensity distribution should be expected? A general theorem of statistics states that the outcome of a series of multiple additive causes should result in a Laplace-Gauss distribution of intensities. Clearly, transcription is not additive, but, rather, multiplicative (each mRNA level comes from the multiplication of many causes, from the metabolism of individual nucleotides to the compartmentalization in the cell). One expects therefore that the logarithm of the intensities should display a Laplace-Gauss distribution ("log-normal" distribution). This is indeed what is observed. A first criterion of validity of an experiment with DNA arrays is therefore to check that the distribution is log-normal. In fact the situation is more complicated (and more interesting) and deviations from the Laplace-Gauss log-normal situation can be exploited to uncover unexpected links between genes that are expressed together. This is the case, in particular, when sets of genes behave independently from each other.

2. Many causes can affect transcription efficiency. This has the consequence that hybridization on DNA chips is highly sensitive to experimental errors. It is therefore necessary to have a significant number of control experiments, which may identify some of the most frequent experimental errors. It should be noted that these errors are biologically significant and tell something on the organism of interest. It is therefore of the utmost importance to try to differentiate, in a given experiment, what pertains to this kind of systematic error, and what pertains to the phenomenon under investigation. An example of such study is presented in an article from the HKU-Pasteur Research Centre.

A general library of references on the use of DNA chips is provided here.

Work on the ubiquitous regulator H-NS illustrates how transcriptome analysis may be used to better understand the role of a regulator located high in the control hierarchy of gene expression in bacteria.

2D Electrophoresis Gels and Proteome Analysis

pI = 4 Separation by electric charge pI = 7

The cell is broken and its protein extracted in a soluble form.

In a first step they are separated according to their electric charge (isoelectric point, pI). Then the first dimension gel is deposited horizontally on top of a mesh that will separate the proteins according to their molecular mass

Finally the proteins are stained using silver nitrate. They appear as black spots, more or less intense, according to the total amount of the protein present in the gel.

Proteins are identified by various methods, mass spectrometry in particular.














The total set of the proteins in a cell is named its "proteome" (a word build up in the same way as "genome"). 2D gel electrophoresis is a quantitative way to display the part of the proteome of a cell that is expressed under particular conditions. Usually, most of the proteins displayed are in a narrow range (slightly acidic pI = 4 to pI = 7) and for molecular masses higher than 10 kilodaltons (i.e. more than about 100 amino acids). Small proteins, and basic or very acidic proteins are difficult to visualize with this technique. Moreover, for the time being, it is still difficult to extract and separate in a reproducible way the proteins that are imbedded in biological membranes. Nevertheless proteome analysis by 2D gel electrophoresis is still extremely important. Not only does this technique give the final state of gene expression in a cell, but it also displays proteins as they are found, i.e. after they have been modified post-translationally.

Work on the ubiquitous regulator H-NS illustrates how proteome analysis may be used to better understand the role of a regulator located high in the control hierarchy of gene expression in bacteria.


Living organisms continuously transform matter chemically. As a matter of fact, this is a way to ensure that a system is alive. A spore, or a seed, will only be considered as alive when it germinates, i.e. at a moment when it will transform some molecules into other molecules. This is the basis of metabolism. There is an intermediate stage, between life and death, named "dormancy", but this is only understood to be a true state after the fact, that is, after the moment when some metabolic property has been witnessed.

In silico Genome Analysis (BioInformatics)

Some Major Genomics Sites and Supporting Agencies

The US Department of Energy

The US National Institutes of Health

The European Union Directorate General XII

Japan Science and Technology Corporation ( JST)

The Sanger Centre

The Kasuza DNA Research Institute

Washington University

Oklahoma University

Baylor College

The Whitehead Institute

The Institute for Genome Research (TIGR)

Le Centre National de Séquençage (Genoscope)

The Beijing Human Genome Centre

The Shanghai Genome Centre

The International Nucleotide Sequences Database Collaboration

The Reference Protein Data Bank at Expasy (SwissProt) (mirror in Beijing)


Our Databases