Information theories and genomics

Genomes and information theories

(A more complete account has been proposed in "On Genomes and Cosmologies" © MIT Press, 1996 , see also "Mapping the bacterial cell architecture into the chromosome " where a first identification of the geometrical programme superimposed on the genetic programme is described)

Genomes are analyzed as linear sequences of symbols written with a limited alphabet. Information, a generally poorly defined concept (see The Delphic Boat), is central to the alphabetic metaphor on which rests genomics. This concept goes back to the Aristotelian discussion of the Being as a relationship between substance, matter and form. Substance is the being-in-itself, in which one can distinguish a determinable, matter, and a determining, form. The differentiation between individual entities was usually ascribed to specific features of form, and the study of causes asked for a special process 'informing' matter. Thus, information is not only related to the specificity given by form — and DNA sequences are related to a very abstract conception of form, but also, implicitly, to causality. This may explain many of the wrong paths which have been followed when using the corresponding concept of information. In ordinary language, there is something quite different when the word information is used: information means some kind of usefulness brought about by a given piece of knowledge on a system. Information is therefore also implicitly associated with the idea of value. We have to bear this observation in mind when considering the various theories of information which can be used to represent living organisms, and in particular, to represent DNA sequences.

Invention of techniques permitting DNA sequencing has brought about a revolution in our way to consider genomes as a whole. Initially, most studies of DNA sequences dealt with a local analysis of the DNA information content. They took into account, however, the availability of homologous structures or functions from different organisms, and thus provided some general view of global properties of genomes. In a more scattered fashion, and, above all, more recently, it has been thought that genomes should be considered as global entities, and we follow this global view. We wish to outline what could be done in the future, emphasizing the fact that computers can be used as experimental tools, generating a new source of investigation of living organisms, their study in silico (in complement to in vivo or in vitro).

Go to Models and Techniques (skip mathematics...)

Concepts

Sequence acquisition has been exponentially growing for twenty five years. In parallel, the power of computing has also increased in an exponential fashion, and it has been possible to foresee treatment of biological data that would have been impossible just a few years ago. Because DNA and protein sequences can be seen as linear texts of an alphabetical writing, they pertain to the theories of information, which have precisely the study of such texts as privileged objects. In what follows we develop ideas showing that the alphabetic metaphor and the corresponding treatment of information is likely to have a very deep meaning in terms of genomes. This will require a progression in the conceptualisation of information, and, because biologists are not always familiar with the underlying mathematics, we give rough principles so that they can follow the reasons for progressing to improve our concepts of information.

Shannon and Weaver's information : a model for the information seen during DNA replication

When considering living organisms, one often casually refers to the "information" they carry, namely in their genomes' DNA sequence. Following John Myhill's terms, information is a "prospective character", which refers to the identification or localisation of control processes involving small amounts of energy, that result in large changes in processes involving large quantities of matter or energy. A very elementary, but useful and often used concept of information has been derived by Shannon. This author defined the information of a collection of digit sequences transmitted through an electromagnetic wave channel submitted to some noise, by comparing the sequences before and after transmission. The information concept invented by Shannon is a very crude view of a message, explicitly considering not its meaning but the local accuracy of its transmission. Genomes can be analyzed using this view, especially if one considers only the information carried out during the process of replication, leaving aside the information involved in gene expression (i.e. replication does not see the meaning of the message). This concept of information can bring in the limelight interesting properties, but one should always bear in mind the fact that the actual information content of a sequence must be much richer (it has a meaning), unless one only considers the information seen by the replication machinery. We shall see later other ways of considering the information content of sequences, where the global signification is better taken into account.

Shannon's theory of information derives from a subset of the general theory of measurable spaces. Given a measurable space Ω (space of events) and a a part of Ω , if ω is an event, chosen at random, one would like to say that if we know that ω belongs to a, then this means that we have some information on ω. Thus, information brought about by asserting ω ∈ a, is noted H(a). H (Shannon's information) is defined in probabilistic terms as follows : H(a) = ƒ(p_a), where p_a is the probability that the information on a is true (the smaller a the easier the location of ω) ; ƒ is supposed:

(1) to be positive and decreasing, and such that a random assertion having equal probabilities to be true or false gives a unit of information : ƒ(1/2) = 1 (this is a natural condition for normalisation).

In addition:

(2) ƒ is such that if A and B are two partitions of Ω and if the assertions ω ∈ A and ω ∈ B are independent, then H(A ⊗ B) = H(A) + H(B).

Finally:

(3) one adds the natural condition that ƒ(1) = 0, meaning that if an event is known to be true for certain (ω belongs to Ω is trivial), this is equivalent to give no information on it.

A consequence of this definition of information is that if Ω is split into a number of n equivalent separate subsets Ω_n one has, after property (3) of H:
H(Ω_n) = n × (1/n) × ƒ(1/n) = ƒ(1/n)

This can be used as a starting point to show that function ƒ is:
∀ p ∈ (0, 1) ƒ(p) = — log₂(p)

More generally, let i ∈ I be a random variable of probability law p_I (where p_I = {p_i | i ∈ I}), then Shannon's information becomes:

H(p_i) = — ∑ {p_i log₂ p_i | i ∈ I}

It can be seen easily that such a definition of information is highly schematic and far from being able to describe the properties of what the common sense would place into "information". Indeed, a more elaborate view would indicate that to consider an event in isolation is meaningless in most actual cases: one needs for instance to possess a means to recover the considered information. Also, when one considers life, the present approach of information does not take semantics into account, as would be necessary if one considers gene expression.

Remarkably, another way to consider the information carried in a message, when it corresponds to an instance of a family of messages considered as similar, leads to the same equation. If p_r is the probability of a message when received, and p_s the probability of this message when it is sent, then the information corresponding to the quality of the message is: H = log₂ (p_r/p_s). If one does not take into account the loss of information during transmission (message transmitted without error) then p_r is equal to unity. And if there is a large number of similar messages the corresponding information is H = - ∑ p_s log₂ p_s. One notes here that the equation obtained has still the same form as in the preceding description.

In the case of DNA, what is taken into account is the probability of a message, that, in the usual description of genomes, reduces to the actualization of a succession of letters (chosen among A, T, C or G) at a given position of the genome text. The specific information which is analyzed corresponds to the difference between an actual collection of DNA sequences, as compared to a random sequence having the same base composition. An evident improvement would be to consider not the one letter successions of the programme, but the relevant words (sequences of letters of definite length). The different approaches involving Markov process analysis of the genome text are a step in this direction. However, this assumes that one knows the ones that are relevant! It seems therefore that analysis of the genomic text using classification methods is a very important prerequisite, for an in depth analysis of its information content.

We have somewhat developed the mathematical aspect of Shannon's information, because we wanted to display the logarithmic law it leads to. Indeed, the form of this law is similar to that of another law, that has been discovered in the domain of thermophysics. And this coincidence has had unfortunate conceptual consequences, when information was related to life, because the ideas of the latter scientific domain contaminated the ideas of the former one, as we shall now see (see also Yockey, 1992).

Information and entropy : Shannon's entropy

It is quite usual to read that there is identity between a form of thermodynamic entropy and the information carried by a given process (but see the analysis by Zurek, 1989). As a matter of fact Shannon's information is often named Shannon's entropy, but this was not the choice made by Claude Shannon! He wrote for example:

" Information theory has, in the last few years, become something of a scientific bandwagon. Starting as a technical tool for the communication engineer, it has received an extraordinary amount of publicity in the popular as well as the scientific press. In part. this has been due to connections with such fashionable fields as computing machines, cybernetics, and automation; and in part, to the novelty of its subject matter. As a consequence, it has perhaps been ballooned to all importance beyond its actual accomplishments. Our fellow scientists in many different fields, attracted by the fanfare and by the new avenues opened to scientific analysis, are using these ideas in their own problems. Applications are being made to biology, psychology, linguistics, fundamental physics, economics, the theory of organization, and many others. In short, information theory is currently partaking a somewhat heady draught of general popularity. And although this wave of popularity is certainly pleasant and exciting for those of us working in the field, it carries at the same time all element of danger. While we feel that information theory is indeed a valuable tool in providing fundamental insights into the nature of communication problems and will continue to grow in importance, it is certainly no panacea for the communication engineer or, a fortiori, for anyone else. Seldom do more than a few of nature's secrets give way at one time. It will be all too easy for our somewhat artificial prosperity to collapse overnight when it is realized that the use of a few exciting words like information, entropy, redundancy do not solve all of our problems ".

This is a consequence of the remark that there is identity between the form of Shannon's law H, and a form of entropy described in terms of statistical mechanics, in particular when describing the specific case of the perfect gas. However, identity in the form of a mathematical description of a process is by no means a reason sufficient to identify the underlying principles, or the very processes with each other. Moreover, several words such as information, entropy, or order, have been used in many different contexts, meaning in each context something very precise, but usually very different from what it means in another context (see The Delphic Boat, for more). Because the ideas of information and order are often used in biology and may influence the exercise of the corresponding knowledge, it seems useful to explore some of the underlying reasons that have permitted authors, in several instances, to identify the negative value of information and entropy.

Indeed, the reason underlying the frequent identification between information and entropy is ideological in nature, and corresponds to a mistaken identification of disorder and entropy, coupled to an unsaid view of the world as spontaneously evolving towards a "bad" ordering. The underlying assumption is that the world is constantly going towards an unavoidable disorder, a fact against which we should fight (see Schrödinger's What is life? and the foreword to its French translation). The starting point of such pessimistic views of the world comes from the mechano-statistical model of thermodynamics, founded upon the very specific and restricted case of the perfect gas. In this model, gas atoms are hard, impenetrable spherical points which interact by pure elastic collisions. The popular description is then that of a container made of two independent chambers, containing different gases and separated by a rigid wall. At the start of the process a small hole is drilled through the wall. It is subsequently easy to observe that the container evolves in such a way that after some time there is an equivalent mixture of both gases in both chambers, each gas having a tendency to occupy the volume from which it was absent, till a final equilibrium where the partial pressure of the gas is equal in both chambers, an equal number of gas atoms coming from one chamber to the other one and vice versa. If one wanted to prevent homogenisation it would be necessary to place a little demon at the hole, with the control of a door permitting him to close the door when the wrong gas is going to the chamber it does not belong to. One then asserts that the action of the demon corresponds to the knowledge of an information on the gases nature… This is thought to be enough to justify the identification between information and entropy. Many technical objections have been given to such a primitive description, but one is compelling: in this way of accounting for a system, its entropy depends on its level of description. However entropy can also be defined as a macroscopic physical entity linked to the total energy of a system and its temperature. It is therefore independent of the model. Thus, the identification between entropy and disorder is disputable (let us consider a two-dimensional representation: a blue gas in a square chamber is separated from a yellow one. Drilling a hole in the separation line will yield a green rectangle through mixture of the blue and yellow gases. Is the green rectangle more disordered than the juxtaposition of a blue square to a yellow one?). In addition, identification between entropy and information is also misleading, because one can immediately notice here that information is not only a qualification of a state of a given system, but that it requires both some knowledge of the system and of the observer. Like order, information is a concept that is relative and must take into account several systems (see The Delphic Boat, Chapter 4).

In fact Shannon's measure of entropy does only reflect a mathematical property of statistical analysis, as discussed below. For the moment however, limiting Shannon's information to what its mathematical form assumes it is, it can be used for a first description of DNA sequences information content. This first level of local information can be used for the identification of coding regions in DNA. A more evolved use of Shannon's information is also present in the analysis of gene sequences using periodical Markov chain analysis, as demonstrated by Borodovsky and co-workers. A consequence of this type of work is the obtaining of large collection of gene sequences from a given organism. This brings us to the use of informatics in the analysis of whole genomes.

A genome is the result of a historical process which has driven a variety of genes to function collectively. Many biologists using computers have experienced the following. When one scans through a sequence library for similarity with a query sequence, one often ends up with a list of sequences that have only in common the organism to which the initial sequence belongs. This is as if there existed a style specific of each organism, independent of the function (an image proposed by P. Slonimski is that of columns in a temple, which can be of Corinthian, Dorian or Ionian style, having however always the same function). Is this impression true? This indicates that there must exist some sort of mutual interaction permitting the cell to discriminate between self and non-self. It seems therefore interesting to investigate the meaning, at the lowest possible level, of what could be mutual information.

Mutual information and standard statistical analysis

In Shannon's terms, it is natural to define mutual information between pairs of events. This is a function that is always less or equal to the information that one possesses on individual corresponding events.

H(p_IJ ; p_I . p_J) = H(p_I) + H(p_J) — H(p_IJ)
= ∑ {p_ij log₂ (p_ij /p_i . p_j) | i ∈ I, j ∈ J}
= ∑ {p_i . p_j (p_ij /p_i . p_j) log₂ (p_ij /p_i . p_j) | i ∈ I, j ∈ J}
= ∑ {p_i . p_j ƒ(p_ij /p_i . p_j) | i ∈ I, j ∈ J}

Now, in usual statistics, there exists another means to evaluate correlation between events, using a specific metrics known as the χ² test. This metric is meant to evaluate the mutual distance (this can be interpreted as some mutual information) between two laws. It is designed in such a way that the mutual distance p_IJ is equal to p_I . p_J when all the knowledge about the association of the two laws is the result of the knowledge about each one:

||p_IJ — p_I . p_J ||² = ∑ {(p_ij— p_i . p_j)²/p_i . p_j | i∈ I, j ∈ J}
= ∑ {(p_i . p_j [(p_ij /p_i . p_j)² — 1] | i ∈ I, j ∈ J}
= ∑ {p_i . p_j ƒ(p_ij /p_i . p_j) | i ∈ I, j ∈ J}

knowing by definition of the laws p_I (normalisation) that the sum of the product p_i . p_j as well as that of the mutual law p_ij is equal to 1.

One can then remark that the final expression has a general form similar to the form of the Shannon's mutual information law: one has simply to exchange the form of the distance function ƒ involved in the χ² paradigm, where it is x²-1, for the form in Shannon's law where it is xlog₂x. Both functions have a very similar behaviour around x₀ = 1: their value, their first derivative as well as their second derivative are equal. In fact, if one proposes to measure mutual information by various plausible laws (mutual "distances"), one is led to the conclusion that the envelope of all such laws is Shannon's information function. This has the consequence that the function xlog₂x is the natural function that should be used in most cases. But this conclusion can also be extended, implying that it is natural to find this same function in many situations otherwise unrelated: when finding that the representation of a phenomenon yields a function of the form xlog₂x one should not infer that the analogy extends to the nature of the phenomena described by this same function!

Methods using χ² are numerous (they can even be used in learning techniques for the partition of training samples). Because it is very powerful but rarely used in biology we shall only refer to one such method, Correspondence Analysis (CA), which makes use of χ² distances for classification of objects without a priori knowledge of the classes. The underlying idea is that one can construct pertinent classes where objects do not obligatorily share simultaneously all of the characters defining the objects in a given class, and nevertheless can be said to belong to the same class because they share most of the characters defining a class. Such an analysis permits one not only to build up classes, but also to build up trees describing the relationships between classes. This permits one under some conditions to build up hierarchical clusters, reminiscent of phylogenies. CA has been used to study codon usage in individual genomes. In the absence of any other knowledge, coding DNA sequences (CDS) can be described by the usage of codons specifying each amino acid residue of the polypeptide they encode. Accordingly, each CDS is represented as a point in a 61-dimensional space, each dimension corresponding to the relative frequency of each of the 61 codons. The set of CDSs is displayed as a cloud of points in the space of normalized codon frequencies. Using the χ² distance between each CDSs pair, CA allows calculation of the 2-D projection of the cloud of points yielding maximum scattering. On such projection, genes that have a similar codon usage will appear as neighbour (but the converse is not necessarily true). In order to analyze this graphical representation in terms of individual classes, it is necessary to use a second method that automatically clusters the objects (here, the CDSs) which are close to one another. For example, in a first step, one splits the collection of objects into k groups by a dynamic clustering method; then, in a second step, objects which are always clustered together in the course of the different partition processes, are selected. Analysis of E. coli CDSs revealed that the hypothesis requiring the minimum assumptions is a clustering into three well separated classes. The codon bias is very strong in class II, intermediate in class I and weak in class III. The proof that this partition is significant comes from the observation that these three classes of E. coli genes can also be clearly distinguished by their biological properties. Class I contains those genes that maintain a low or intermediary level of expression, but can be potentially expressed at a very high level (e.g. the lactose operon). In contrast, class II contains genes that are constitutively expressed at a high level during exponential growth. Most of these genes are involved in the translation and transcription machinery, as well as in the folding of proteins. In class III, the corresponding genes have been shown to be involved in functions required for horizontal transfer of genes.

With these examples, we observe that investigation at a first level of the information content of a genome is rewarding. In particular, it tells us that the actual selection pressure on the fine evolution of sequences inside genes is visible when one possesses a large collection of genes for analysis. Because this corresponds to a fine variation, such as a general bias in nucleotide distribution in genes, and comprise many genes at the same time, it becomes difficult to be able to propose experiments to test whether the interpretation of the observation adequately reflects reality. Indeed, it must be difficult to submit a modified organism to a "normal" life-cycle (i.e. a life-cycle that differs strongly from laboratory conditions, including chemostats) and to test whether its DNA has evolved according to the predicted consequences of a model. In some cases however experiments have been performed which can in answer some of the predictions: Yanofsky and co-workers, for example, have submitted a culture of E. coli lacking a functional mutT gene (the product of which prevents AT to GC transitions) for evolution during a large number of generations, and they have observed that, after some time, the overall content of the DNA was significantly more GC-rich than the parental strain.

In the examples just given the information content of a sequence was considered mostly as a function of its local value. A more appropriate way to consider the same information would be to evaluate it as a function of its context, as we shall see now.

Eukaryotes and prokaryotes: algorithmic complexity

A genome sequence is certainly far from being random. But this is not only due to the fact that it actually contains functional information (i.e. information used for the definition of molecules that operate in metabolism and cell construction), but also because the starting DNA molecules were certainly not random. For this reason, given a sequence, it is very difficult to estimate its actual probability: one has to compare it to an unknown baseline. When studying the frequency of words (such as TA, CCTTGG, for instance), one has to estimate the probability of occurrence they would have if the sequence was random. This does not mean that it can compare to a purely stochastic sequence of letters of same overall probability of occurrence: a way to overcome this difficulty is to choose regions of the genome, such as coding sequences, where it is known that the major selection pressure has been operating only indirectly on DNA, using this as a means to calculate a baseline. For example, one can study the frequency of words overlapping codons (four or more letters) by comparing the level of occurrence of a given word with its level of occurrence if all the codons used in every CDS were taken into account, but where their actual succession would be random. In this case most of the bias comes from the style of the genome (which can, in part, be reflected by a non-random frequency of dipeptides in proteins, resulting in some contribution to the bias in the usage of words). As an illustration; it can be observed that among 4-letter palindromic sequences, CTAG is counterselected in many organisms (except those where C is modified, such as bacteriophage T4 where C is replaced by hydroxymethyl cytosine). This seems to be due to a special conformation of the double stranded helix when it carries such sequences of motifs, resulting in some weakness (perhaps inducing spontaneous breaks). In all other cases the words frequency reflects the specific style of each genome.

Is it possible to go farther, and extract significant global properties of genomes? Relationships between objects, such as those which are of fundamental importance in biology, should be taken into consideration. In particular, the very nature of the information carried, from generation to generation, in the DNA molecule that constitute genes is not at all present in the analogy between entropy and Shannon's information. For example "context", "meaning" or "semantics" are, by construction, absent from this extremely primitive model. Biology provides us with a metaphor that displaces the idea of information towards a new field, that of programming and informatics. Is there more insight in these new fields than in the "natural" way of considering information? At least since 1965, Kolmogorov and Chaitin, following Solomonoff in the USA and the Russian school of electronics engineers, have formulated the problem in details in the following way (for a general description see Yockey, 1992). Let us consider the simple case of a chain of characters such as those found in computer sciences. What can be said about its information content? A way to consider the chain is to try to reduce its length so that it can be accommodated in a memory, for example using the minimum space, without altering its performance when used in a program for example (at least in terms of accuracy, if not in terms of time), in short, without losing its information content. This is called compressing the data. This problem is of a very broad and general interest. Because a chain of characters can be identified with an integer, the universal Turing's computation rules apply. It is in particular possible, given a chain, to define the shortest formal program (in terms of Turing's universal computation algorithms — i.e. algorithms that can be implemented on any machine operating on integers) that can compress an original chain, or restore it given its compression state. The information value of a chain S is therefore defined in this model as the minimal length of the universal program that can represent S in a compressed form.

With this definition it appears that a completely random chain S cannot be compressed, implying that the minimal program required to compress S is identical to S. It should be noticed here that an apparently random chain may be compressible: this is the case for instance of the chain formed by the digits of the decimal writing of the number π, which can be generated by a short program. In contrast, a chain made of repetitive elements can be summarised as a very short program: the sequence of the element, and the instruction "repeat". In this context, the information of a sequence is defined as the measure of its compressibility (one often uses the word "complexity" rather than information, in this context). It is however apparent here that this is not all of what can be said, should we wish to define information in a finer sense. Indeed, the program defining π is highly compressible, as is the program generating the sequence 01010101…, but we would like to say that the information imbedded in π is richer than the one imbedded in 01010101…

Represented as sequences of letters, genes and chromosomes have an intermediate information (complexity) content: one finds local repetitions or in contrast, sequences which are impossible to predict locally. Their complexity is intermediary between randomness and repetition. The apparent complexity of sequences originating from higher eucaryotes or procaryotes is very different, and this links genomic sequences to both sides of the seemingly uninteresting fraction of information (because it corresponds on the eucaryotic side to repetition and on the procaryotic side to randomness). The complexity of the former is more repetitive, and looks usually much lower than that of the latter, which looks more random. This is quite understandable if one remembers that bacterial or viral genomes are submitted to stringent economy constraints implying that they must remain very compact. In contrast genomes of higher eucaryotes are extremely loosely constrained, and they contain, for instance, many repetitive sequences.

However, would it not be more natural to say that eucaryotes are more complex than procaryotes? We can propose here that this difference in the form of algorithmic complexity is perhaps the actual cause of the main morphological differences between procaryotes and differentiated organisms. The former, following their strategy for occupying biotopes thanks to a rapid and versatile adaptive power, are forced to keep up with a small genome. This implies that several levels of meaning must often be superimposed in a given sequence (for example promoters overlapping CDSs). And the apparent result of this constraint is that their actual algorithmic complexity looks more like that of a random sequence. But this constraint has an important physical consequence (because DNA is not only an abstract sequence of symbols, but also a molecule that must occupy space): superposing signals on the same DNA stretch generally precludes to combine synchronous recognition processes by specific proteins, for it is not possible to place two different objects at the same time at the same location. But, to exclude such combinations (which alone can fulfil the logical principle of third party exclusion) prevents investigating the refined principles of a delicate modulation of gene expression The exactly opposite situation is found in differentiated organisms such as animals (and perhaps in the regions of procaryotic genomes displaying a low complexity). The lack of limitation in the length of the DNA text permits juxtaposition of recognition signals by regulatory proteins, and allows exploration of the properties of their association according to the rules of combinatorial logic. This is likely to be what permits differentiated expression as observed in the various cell types. Indeed, this is already well understood for the regions controlling transcription (promoters, enhancers and the like), but this is probably also the reason for the multiplication of introns in higher eucaryotes, as they behave as insets without apparent signification, in the very coding regions of genes… Thus, analysis of information in terms of algorithmic complexity permits us to propose explanations accounting for the difference between major organisms types, while Shannon’s information does only give a meaning to conservation of a DNA sequence, with a low level of errors, from generation to generation, through the replication machinery.

Many analyses of genomes indirectly pertain to analysis of their algorithmic complexity, and this provides an interesting view of the general organisation of base sequences in chromosomes. But this takes genomes as fixed, synchronous, entities. In general terms, higher eucaryotes seem to be much more complex than bacteria, be it only because they are able to exhibit time-dependent differentiation patterns. It appears therefore that something more should be said about their information content: information cannot be identified simply with algorithmic complexity, so that Chaitin and Kolmogorov's information only represents part of what we are interested in. In the examples just provided the concept of information did not imply the notion of value (depending both on the object considered and on the nature of the problem that has to involve the object). In fact, if one considers an information its actual availability must have some value. An information which is in the reach of everyone must differ from one which is difficult or long to obtain. In the case of living organisms, knowledge of the paths which have been followed through evolution leading to the organisms as we know them today, are a very valuable piece of information, difficult to obtain. Indeed, having access to this type of information may require the introduction of time, as we shall now consider.

Evolution : Bennett's logical depth

We have seen that not only would we like to have an idea of the complexity of a sequence (as evaluated using the above approaches), but also of the availability of the information contained in the sequence. This is particularly relevant in the case of DNA sequences, because all that can be inferred from the knowledge of a sequence derives from the accessibility of its actual information content. To see more repetitions in eucaryotes and more apparent randomness in procaryotes does not account for the paths which have led to such a large overall difference in genomes. We wish to know the usefulness of the information in the sequence. A very short recursive program can be written to write the digits of π and it is possible to write a program having the same length, generating a repetitive sequence. But the availability of the information content of the outcome of the corresponding programs is very different. Although it is easy to predict, after a very short computation, the value of the n^th digit of the repetitive sequence, the same is not (yet) true for π. In the latter case, if n is chosen large enough, a long computation time is required until the digit is generated, using the program permitting calculation of π.

This observation is essentially important when one considers sequences generated by recursive programs. In many cases, the writing of the program can be short because the procedure that is reiterated is nested inside the program. Therefore the information content of the sequence, in terms of algorithmic complexity, is small, because it can be summarized as a short program. But the information on a given digit could be extremely tedious to obtain, and could be obtained only after a very long computation time. In this respect the knowledge of the precise value of this digit, the "information" on it would seem to be very high, had we another quick means to obtain it. Bennett has formalised this aspect of sequences, in terms of "logical depth". The logical depth of a sequence S is the minimum time required for generating S using an universal Turing machine. Obviously, knowing a sequence, it is not possible to know its algorithmic complexity. It is even less obvious to have an idea of its logical depth.

For this reason, the conceptual analysis made above might appear to be only philosophical (and philosophical is a bad word in English… — but not in Latin or Greek civilisations!). Note that deep thinkers such as the late Rolf Landauer (who was at the root of many important concepts in information theories, demonstrated both experimentally and theoretically that widespread ideas about entropy and order, as well as creation of information, are plainly wrong) is at the origin of the discovery that creating information does not require consumption of energy.

Here, we can propose efficient means to analyze both the algorithmic complexity and the logical depth, by proposing algorithms meant to generate the sequence which is under analysis. For this we must propose educated guesses, and our contention is that, starting from an analysis of the underlying biological processes, we can indeed propose constructive hypotheses. For example, trying to represent how RNA polymerase recognizes its promoter will permit us to generate an algorithm permitting to identify promoters from sequences. Cases in point are the algorithms proposed by d'Aubenton et al. for the recognition of transcription terminators or by Michel and co-workers for the recognition of group I introns. The same is true in the case of logical depth: if this aspect of information carried by DNA is relevant, then the algorithms needed for analysis should require significant computing time for giving interesting outputs. This is particularly true in the case of phylogenetic analyses, and indicates that algorithms that perform too fast are necessarily flawed with respect to the biological relevance of their output. A positive side of this very negative observation is that time adds a new dimension to sequence analyses: it could be interesting to add to the features permitting comparisons between sequences, the time required for the computation algorithms performed on them, because similar sequences ought to perform with similar computation times.

Logical depth relates to the Aristotelian distinction between potentiality and reality. It rightly suggests that one should not be allowed to speak in terms of potentiality as if in terms of reality. There are cases when to speak of potentiality is nonsense, because the actualisation of the potential, in true reality, is impossible in a reasonable period of time. Only those facts that are mechanistically generated could permit such a gross identification between potentiality and future reality. It must be stressed here that reflection on this aspect of information does not completely exhaust the non-clarified (natural) concept of information. In particular, it rests on the existence of an abstract, but perfect, Universal Turing Machine. In the case of living organisms — provided that one can identify their behaviour as that of Turing machines — the building up of the machine is the result of the actualisation of a finite program, present in the machine. But here one should take time into account once again, and consider only those algorithms the length of which is smaller than the lifetime of the machine as a given, unaltered, structure. Indeed, a time limit, defining the algorithms permitting to construct the machine, must also be taken into account: it defines a critical depth, likely to be of major importance in the case of life. We shall limit ourselves to these description however to consider now the specific case of living organisms.

A local view of genomes: alignment and consensus of molecular sequences

Since the time when protein and nucleotide sequences have been available (and this goes back to the early fifties for protein sequences), scientists have attempted to compare sequences to each other, and to align those sequences that were supposed to match each other after having diverged from a common ancestor. A paragon study of such comparison was the Atlas of Protein Sequences and Structures developed by Margaret Dayhoff and her co-workers. After thorough analyses of the phylogenetic divergence in families of globular proteins such as cytochromes or globins, this work permitted construction of a correspondence matrix between amino acids that reflected some of the evolutive fate of each residue. Hundreds of articles have been written proposing general and less general approaches for creating alignment of sequences and generating the corresponding consensus.

While it was relatively easy to align similar sequences, it was often impossible to align very divergent sequences unless some other related sequences were also known and compared in parallel (multiple rather than pairwise alignment). This resulted in the hunt for intelligent algorithms permitting alignments and multi-alignments, and providing the corresponding consensus of multiple sequences. However, it is clear that the production of alignments requires not only production of a matrix for correspondence between nucleotides or similar amino acid residues, but also a choice for the penalties assigned to the gaps which are present in the sequences when they are aligned. The use of dynamic programming can provide a first set of aligned pairs. Many programs, derived from or created independently from the Needleman and Wunsch algorithm, including use of dedicated hardware, have been proposed. Unfortunately, these programs cannot be used for large sets because of combinatorial explosion. In contrast, looking for common words in a composite sequence made of the concatenation of all sequences to align, followed by a manual fine alignment where the unformalized knowledge of the biologist is used can yield interesting results. In this process matrices meant to represent similarities between amino acids can be used for protein alignment. It is interesting to use matrices which take into account the constraints on amino-acid residues which have been measured from comparison of known 3D structures, and multiple alignments can also implement knowledge from such structures. But it must be stressed that a single matrix is used throughout the sequences, meaning that one deals with a composite view of the equivalence rules between amino-acids at each position in the sequence. This cannot represent what happens in reality because at some positions for example, it is the aromatic feature which is important (meaning that F, Y and W are equivalent) whereas at other positions it could be the fact that the residue is big (meaning that F, Y, R, H and W are equivalent) so that one should use an equivalence matrix for each position in the sequence. This indicates that new research programmes must still be developed in this field, despite the large number of papers already published. As a consequence the consensus approach is bound to fail as soon as sequence are significantlly diverging.

Learning techniques, the Perceptron and neural networks

Thus, when significant numbers of sequences are compared with each other one may wish to extract a significant pattern not as a consensus sequence, but, rather, as a matrix of probability reflecting the frequency of a given motif at a given place. This is reminiscent of pattern recognition problems which involve, for example, learning techniques or techniques used in pattern recognition by vision. It is not unexpected therefore that methods involving neural networks, or the precursor of this new field of informatics, the Perceptron, have been used for generating pertinent patterns. Stormo and co-workers have pioneered the field by using a Perceptron approach for creating matrices permitting identification of the ribosome binding sites in E. coli messenger RNAs, and they have been followed by many other authors both for the study of DNA and protein sequences. In addition to the fact that it rests on the hypothesis that one must find consensus matrices, a main drawback of the Perceptron approaches, unfortunately also present in many neural network learning techniques, is that they involve formal "synapses" which have a transmitting strength evolving in a quantitative fashion as a function of their actual use. This means that, except in the cases where there is obvious and strong similarity between objects in the training set, the synapse strength fluctuates, going up and down. As a consequence, when the set size increases the strength of each synapse generally goes through an optimum value, and then slowly goes back towards a more or less average value as more exceptions invade the training set, thus loosing discrimination power. Accordingly, as the training set of examples increases in size, the actual efficiency of the consensus matrix created by the learning procedure looses accuracy. Ancestors of neural networks which evolve in such a way that the effective transmitting capacity of a synapse goes irreversibly to zero when its value falls below a certain threshold value, freezing the learning state at an optimal value, have been proposed (Danchin, 1979). This feature has been recently implemented in neural networks by (Horton and Kanehisa, 1992). It would therefore be important to compare and develop new approaches involving neural networks with emphasis on the stability of their learning capacity as a function of the training sets.

Other learning techniques rest on the explicit construction of grammars, i.e. sets of structures and rules permitting the complete description of a sequence or a network of genes. They can be divided roughly into two classes, learning by assimilation (i.e. by measuring conformity to features of a training set), and learning by discrimination (i.e. by constructing descriptors of the differences between the sets of examples supposed to differ from each other). An example of the first type is the Calm software, used for identifying the calcium binding sites in calmodulin, for instance, whereas an example of the second type is the Plage software used in the identification of the relevant features of signal peptides in E. coli. Both types are dependent on a grammar which is used to generate descriptors that are used on the training sets. In general, grammars are essential to describe biological phenomena, and their efficiency rests heavily on the actual biological knowledge of their conceptors, providing therefore an interesting way to incorporate biological knowledge into automated procedures. For this reason one should probably always consider as the most pertinent ones, the grammars which are devoted to a specific question. A case in point is the work of Collado-Vides, who constructed a grammar for the identification of regulated promoters in Escherichia coli.

Genomes "in silico": finding descriptors and rules in the algorithmic nature of life

The argumentation that has been just developed is meant to be a substratum for the reflection on the metaphor of program (algorithm) which is currently used in molecular genetics. It is known that a giant molecule, DNA, made of the sequential arrangement of four types of related molecular species specifies entirely the structure and dynamics of any living organism (adding its own constraints to a world where physics and chemistry operate, permitting in particular to generate compartmentalisation and metabolism). We shall not consider here the problem of origins of such organisms, but only try to investigate the nature of the relationship between the DNA sequence and the building up of an organism. As in the case of the reflection of Turing in Number Theory, one can surmise that there must exist a machine to interpret the content of a DNA sequence. This machine is a pre-existing living cell, made of a large, but finite, number of individual components organized in a highly ordered way. Given such a machine, a fragment of DNA, provided it contains a well-formed sequence, is necessary and sufficient to produce a specific behaviour representing part or all of the structure and dynamics of an organism. The appropriateness of the alphabetic metaphor used to describe the DNA sequence is further demonstrated by the fact that it is possible to manipulate on paper four symbols A, T, G, C, and to organise them in such a way that, when interpreted into chemical terms (i.e. linked into the proper sequence arrangement into an artificial DNA molecule) they produce a behaviour having at least some of the properties of real natural biological entities. This is the basis of survival and multiplication of viruses as well as of a success story in biotechnology: it is possible to make bacteria synthesise a human growth hormone that has been demonstrated on children to be as active as the natural product.

That the machine is, in a way, independent of DNA has also been further indicated by the fact that it is possible to clone organisms by replacing the nucleus of a cell by a new nucleus, resulting in the construction of a new organism (i.e. both a new DNA molecule, and a new machine). However, DNA alone is absolutely unable to develop into a living organism: this is why the question of origin must be considered if one wishes to resolve this chicken and egg paradox, through the coevolution of the machine and of the program. It must be stressed here that to apply the consequences of Turing's arguments, it is in principle necessary that program and data be well identified and separate entities. As stated above, biotechnology uses this fact to permit synthesis of human proteins in bacteria. But this corresponds to small genome fragments. Virus which infect cells do the same to reproduce. Finally, at least as a first approximation, bacteria behave as if separating data (the cell machinery, excluding DNA) and program (the chromosome): a fifth of the Escherichia coli chromosome is made of DNA on which selection operates in such a way that it favours and sustains horizontal exchanges not only between bacteria of the same species, but also between distantly related organisms (class III genes).

Now, if one considers the role of a DNA molecule in a cell it is possible to describe the various functions derived from its presence as calculable in an algorithmic way. Many routines in the process are recursive (i.e. call themselves, but must terminate). It is a generative process which permits generation of both a DNA molecule and of the machine and data necessary to interpret it. Although this has often been called self-reproduction, it is production of an organism that is similar if not identical to the starting organism. One should speak therefore of self-reference. A living organism is in this context the algorithmic actualization of a program, the initiation of which has been obtained through a selective process, keeping only actualizations which are stable in time. And it should be noticed here that this means that selection does not operate on the phenotype (data) nor on the genotype (program) but on the unfolding of the algorithm, i.e. on the organism itself, thus placing in a new light the usual paradox facing all selective theories, and standard Darwinism in particular.

In this respect it is possible to discuss the information content of a DNA sequence or its complexity, using the various definitions given above. At first sight, most if not all of the sequence must be specified to generate the daughter sequence. Things would be simple, and the corresponding information — that which is required for reproduction — would even be rather poor, if there were an absolute fidelity in replicating the original. However, the rule of replication (A -> T, T -> A, G -> C, C ->G) is not absolutely faithful, even if it can be considered to result from an algorithmic process. Variations are imbedded in the replication process, as a function of the general environment of the organism as well as a function of the local sequence of the DNA segment that is replicated. This corresponds to an information that should be taken into account, and that is very important indeed, as we shall seen later. But it must be emphasized at this point that the existence of error management results in specific features of the DNA sequence. As already stressed CTAG sequences are generally rare, and in E. coli this is compensated by an excess in CTGG words. Spontaneous deamination of cytosines is likely to be compensated by a bias in the fidelity of replication, tending to create more GC pairs. In turn, this mechanism can result in an overshoot, as seen in E. coli strains lacking the mutT gene. Many other processes are certainly modelling the overall GC content of genomes, resulting in a genome style (in this case G+C content) which has been taken into account by taxonomists for preliminary classification of organisms. But such evolution can be much subtler. Chi sites are oriented along the replication fork direction in E. coli, and it is likely that, because replication is semi-conservative, there exists some bias in regions where Okasaki fragments are initiated (this should result in biases having a periodicity of the order of a few kilobases). Another bias should be introduced by the restriction modification systems present in bacteria, as well as by the existence of methylated sites (as indeed observed in a few cases). Finally, there should exist biases in the DNA composition along genes corresponding to the fine tuning of gene expression as a function of environmental conditions and/or of location of the transcribed or translated regions in the genes. Hénaut and co-workers have indeed observed such biases as the enrichment in C+G of the third codon letter as a function of the position of the codon in the genes sequence. Despite their evidence these observations are difficult to submit to experimental falsification, because this would require, for example, to construct artificial genes and study their evolution for a very long period of time in conditions that may be very different from natural conditions (chemostats). It becomes therefore important to make predictive tests using computers. But to what extent can we make predictions?

Gödel has demonstrated that recursive processes have the very interesting property to be open, in such a way that there exists propositions in Natural Number Theory that can be generated from a corpus of axioms and definition, and that are not decidable in the original corpus. Gödel's demonstration is very ingenious, and it uses a coding process that is certainly reminiscent of the coding process known as the genetic code. However this demonstration has a specific feature: the aim of the demonstration, permitting Gödel to generate his famous "sentences", is known beforehand. The situation faced by living organisms is different: they cannot predict the future, they can have no aim except for (re) producing themselves. Our hypothesis will be here that the variation provided in the replication and recombination machinery, permitting the generation of new DNA sequences is the very process that permits living organisms to generate the equivalent of Gödel's sentences in terms of new, deterministic but utterly unpredictable organisms. The process involves both a procedure for generating variants, and a procedure permitting accumulation of "adapted" variants. The latter is the so-called natural selection, which must be understood as a process screening stability (and not fitness!) of the surviving organisms. According to this view the spencerian "survival of the fittest" is therefore but a surviving variant of the "instructive" lamarckian thought.

If this is accepted, one is faced with the puzzling observation that to evaluate the actual information content of a DNA molecule, one has to trace back its history, thereby asking for an extremely slow procedure: the information content of a DNA sequence is so deep (in Bennett's terms) that it would require the age of life on Earth to evaluate its true complexity. Bearing this in mind we should be very careful and modest when trying to predict the future of living organisms. In addition, one observes that the process of natural selection associated with generation of mutations provides a "blind" procedure for generating the equivalent of Gödel's sentences (in fact, Gödel was somewhat cheating, because he knew what he wanted to obtain, which living organisms cannot do). This process also generates many more sentences of a type the existence of which we are unable to predict (but the existence of which we can conjecture, simply because at least one type — Gödel's type — does indeed exist). This corresponds to the truly creative phenomena that are derived from living organisms.

A much wanted technique development: managing data

Genome analysis will yield major insights into the chemical definition of the nucleic acids and proteins involved in the construction of a living organism. Further insight comes from the chemical definition of the small molecules that are the building blocks of organisms, through the generation of intermediary metabolism. And, because life requires also in its definition the processes of metabolism and compartmentalisation, it is important to relate intermediary metabolism to genome structure, function and evolution. This requires elaboration of systems for constructing actual metabolic pathways and when possible, dynamic modelling of metabolism, and to correlate pathways to genes and gene expression. In fact, most of the corresponding work cannot be of much use because the data on which it rests have been collected from extremely heterogeneous sources, and most often are obtained by in vitro studies. The initial need is to collect, organize and actualise the existing data. In order to be effective and lasting, collecting the data should proceed through the creation of specialized databases. To manage the flood of data issued from the programs aiming at sequencing whole genomes specialized databases have been developed. They make it possible not only to bypass a meticulous and time-consuming literature searches, but also to organize data into self-consistent patterns through the use of appropriate procedures which are aiming at the illustration of collective properties of genes or sequences. In addition to sequence databases, it has then become important to create databases where the knowledge progressively acquired on intermediary metabolism could be symbolised, organized and made available for interrogation according to multiple criteria.

Using organized data, it will become possible to make in silico assays of plausible pathways or regulation before being in a position to make the actual test in vivo. Such well-constructed databases will also permit investigation of properties of life which go back to the origin of life, placing biology in a situation which is not unlike that of cosmology.

A help for discovery: searching neighborhoods

Fortunately, in silico analysis permits one to organise knowledge. To generate new knowledge, why not explore neighborhoods of biological objects, considering genes as starting points, stressing that each object exists in relation with other objects? Inductive and abductive exploration will consist in finding all neighbors of each given gene. "Neighbor" has here the largest possible meaning. This is not simply a geometrical or structural notion. Each neighborhood sheds specific light on a gene, looking for its function as bringing together the objects of the neighborhood. A natural neighborhood is proximity on the chromosome. Another interesting neighborhood is similarity between genes or gene products. Genes can have similar codon usage bias: this is another neighborhood, as is the similarity in molecular mass or isoelectric point. Also, a gene may have been studied by scientists in laboratories all over the world and it can display features that refer to other genes: its neighbors will be the genes found together with it in the litterature (genomics "in libro", which will require construction of new software apt to extract information from articles automatically). We do not possess heuristics permitting direct access to unknown functions, and this should make clear to us that in silico analysis will never replace validation in vivo and in vitro: let us hope that propagation of erroneous assignments of functions by automatic interpretation of the genome texts will not hinder discoveries. Knowing genome sequences is a marvelous feat, but it is the starting point, not the end.

Large-scale techniques

This global view of genomes means that it becomes impossible to study genes in isolation. Once genomes are known it is necessary to understand the collective behavior of gene products. This means expression-profiling experiments both at the level of transcripts and at the level of protein products. In parallel, the dual gene system is the metabolite system: one must monitor the fate of all metabolites in the cell, as a function of the genetic and physico-chemical environment of the cell. This corresponds to the development of "functional genomics" techniques: all require highly-evolved mathematical and statistical approaches and are often grouped under the fairly general (and vague) name Systems Biology.

Dernières publications Recent Publications	Bases de données Databases		Accueil Home

Philosophie	Origine de la Vie	Origin of Life	Philosophy

Philosophie

Origine de la Vie

Origin of Life

Philosophy