A
first account has been proposed in "On Genomes and
Cosmologies"
© MIT Press, 1996 ,
see also "Mapping the bacterial cell architecture into
the chromosome "
where a first identification of the geometrical programme superimposed
on the genetic programme is described
While the word « information »
is currently used in Biology, its meaning is not accurately defined.
This widespread use nevertheless emphasises the need to add a
fifth entity to the four entities considered in classical physics
to account for Reality, matter, energy, space and time, which
are associated together in the remarkably concise equation proposed
by Albert Einstein, E=mc2. While not compatible
with classical physics, Werner Heisenberg's indeterminacy principle, Δx Δp
≥ h/4π, introduces information via
« lack of information ». In a nutshell, I contend that
we are at the dawn of a new era in Natural Sciences, where information will
play an ever increasing role as we will better understand and
model the concept. The core of our future exploration will be
to try and understand how information is articulated with matter,
energy, space and time. This view implies a considerable change
in the placing of Biology in the Auguste Comte's hierarchy of
sciences, according to increase in information, and progressively
less influence of matter, energy, space and time:
Genomes are analyzed
as linear sequences of symbols written with a limited alphabet.
Information, a generally poorly defined concept (see The
Delphic Boat), is central to the alphabetic metaphor
on which rests genomics. This concept goes back to the Aristotelian
discussion of the Being as a relationship between substance,
matter and form. Substance is the being-in-itself, in which
one can distinguish a determinable, matter, and a determining,
form. The differentiation between individual entities was usually
ascribed to specific features of form, and the study of causes
asked for a special process 'informing' matter. Thus, information
is not only related to the specificity given by form and
DNA sequences are related to a very abstract conception of
form, but also, implicitly, to causality. This may explain
many of the wrong paths which have been followed when using
the corresponding concept of information. In ordinary language,
there is something quite different when the word information
is used: information means some kind of usefulness brought
about by a given piece of knowledge on a system. Information
is therefore also implicitly associated with the idea of value.
We have to bear this observation in mind when considering the
various theories of information which can be used to represent
living organisms, and in particular, to represent DNA sequences.
Invention of techniques permitting DNA sequencing
has brought about a revolution in our way to consider genomes
as a whole. Initially, most studies of DNA sequences dealt
with a local analysis of the DNA information content. They
took into account, however, the availability of homologous
structures or functions from different organisms, and thus
provided some general view of global properties of genomes.
In a more scattered fashion, and, above all, more recently,
it has been thought that genomes should be considered as global
entities, and we follow this global view. We wish to outline
what could be done in the future, emphasizing the fact that
computers can be used as experimental tools, generating a new
source of investigation of living organisms, their study in
silico (in complement to in vivo or in vitro).
Go to Models
and Techniques (skip
mathematics...)
Sequence acquisition has been exponentially growing
for twenty five years. In parallel, the power of computing
has also increased in an exponential fashion, and it has been
possible to foresee treatment of biological data that would
have been impossible just a few years ago. Because DNA and
protein sequences can be seen as linear texts of an alphabetical
writing, they pertain to the theories of information, which
have precisely the study of such texts as privileged objects.
In what follows we develop ideas showing that the alphabetic
metaphor and the corresponding treatment of information is
likely to have a very deep meaning in terms of genomes. This
will require a progression in the conceptualisation of information,
and, because biologists are not always familiar with the underlying
mathematics, we give rough principles so that they can follow
the reasons for progressing to improve our concepts of information.
When considering living organisms, one often
casually refers to the "information" they carry,
namely in their genomes' DNA sequence. Following John Myhill's
terms, information is a "prospective character",
which refers to the identification or localisation of control
processes involving small amounts of energy, that result in
large changes in processes involving large quantities of matter
or energy. A very elementary, but useful and often used concept
of information has been derived by Shannon. This author defined
the information of a collection of digit sequences transmitted
through an electromagnetic wave channel submitted to some noise,
by comparing the sequences before and after transmission. The
information concept invented by Shannon is a very crude view
of a message, explicitly considering not its meaning but
the local accuracy of its transmission. Genomes can be
analyzed using this view, especially if one considers only
the information carried out during the process of replication,
leaving aside the information involved in gene expression (i.e. replication
does not see the meaning of the message). This concept
of information can bring in the limelight interesting properties,
but one should always bear in mind the fact that the actual
information content of a sequence must be much richer (it has
a meaning), unless one only considers the information seen
by the replication machinery. We shall see later other ways
of considering the information content of sequences, where
the global signification is better taken into account.
Shannon's theory of information derives from
a subset of the general theory of measurable spaces. Given
a measurable space Ω (space of events) and a a
part of Ω , if ω is an event, chosen
at random, one would like to say that if we know that ω belongs
to a, then this means that we have some information
on ω. Thus, information brought about by asserting ω ∈ a,
is noted H(a). H (Shannon's information) is defined
in probabilistic terms as follows : H(a) = (pa),
where pa is the probability that the information
on a is true (the smaller a the easier the location
of ω) ; is supposed:
(1) to be positive and decreasing, and such
that a random assertion having equal probabilities to be
true or false gives a unit of information : (1/2)
= 1 (this is a natural condition for normalisation).
In addition:
(2) is such that if A and B are
two partitions of Ω and if the assertions ω ∈ A and ω
∈ B are independent, then H(A ⊗ B) =
H(A) + H(B).
Finally:
(3) one adds the natural condition that (1)
= 0, meaning that if an event is known to be true for
certain (ω belongs to Ω is trivial),
this is equivalent to give no information on it.
A consequence of this definition of information
is that if Ω is split into a number of n equivalent
separate subsets Ωn one has,
after property (3) of H:
H(Ωn) = n ×
(1/n) ×
(1/n) = (1/n)
This can be used as a starting point to show
that function is:
∀ p
∈
(0, 1) (p) = log2(p)
More generally, let i ∈
I be a random variable of probability law pI (where pI =
{pi | i ∈
I}), then Shannon's information becomes:
H(pi) = ∑
{pi log2 pi | i ∈
I}
It can be seen easily that such a definition
of information is highly schematic and far from being able
to describe the properties of what the common sense would place
into "information". Indeed, a more elaborate view
would indicate that to consider an event in isolation is meaningless
in most actual cases: one needs for instance to possess a means
to recover the considered information. Also, when one considers
life, the present approach of information does not take semantics
into account, as would be necessary if one considers gene expression.
Remarkably, another way to consider the information
carried in a message, when it corresponds to an instance of
a family of messages considered as similar, leads to the
same equation. If pr is the probability
of a message when received, and ps the probability
of this message when it is sent, then the information corresponding
to the quality of the message is: H = log2 (pr/ps).
If one does not take into account the loss of information during
transmission (message transmitted without error) then pr is
equal to unity. And if there is a large number of similar messages
the corresponding information is H = - ∑
ps log2 ps. One notes
here that the equation obtained has still the same form as
in the preceding description.
In the case of DNA, what is taken into account
is the probability of a message, that, in the usual description
of genomes, reduces to the actualization of a succession of
letters (chosen among A, T, C or G) at a given position of
the genome text. The specific information which is analyzed
corresponds to the difference between an actual collection
of DNA sequences, as compared to a random sequence having the
same base composition. An evident improvement would be to consider
not the one letter successions of the programme, but the relevant
words (sequences of letters of definite length). The different
approaches involving Markov process analysis of the genome
text are a step in this direction. However, this assumes that
one knows the ones that are relevant! It seems therefore that
analysis of the genomic text using classification methods is
a very important prerequisite, for an in depth analysis of
its information content.
We have somewhat developed the mathematical aspect
of Shannon's information, because we wanted to display the
logarithmic law it leads to. Indeed, the form of this law is
similar to that of another law, that has been discovered in
the domain of thermophysics. And this coincidence has had unfortunate
conceptual consequences, when information was related to life,
because the ideas of the latter scientific domain contaminated
the ideas of the former one, as we shall now see (see also Yockey,
1992).
It is quite usual to read that there is identity
between a form of thermodynamic entropy and the information
carried by a given process (but see the analysis by Zurek, 1989).
As a matter of fact Shannon's information is often named Shannon's
entropy, but this was not the choice made by Claude Shannon!
He wrote for example:
" Information theory has, in the last
few years, become something of a scientific bandwagon.
Starting as a technical tool for the communication engineer,
it has received an extraordinary amount of publicity in
the popular as well as the scientific press. In part. this
has been due to connections with such fashionable fields
as computing machines, cybernetics, and automation; and
in part, to the novelty of its subject matter. As a consequence,
it has perhaps been ballooned to all importance beyond
its actual accomplishments. Our fellow scientists in many
different fields, attracted by the fanfare and by the new
avenues opened to scientific analysis, are using these
ideas in their own problems. Applications are being made
to biology, psychology, linguistics, fundamental physics,
economics, the theory of organization, and many others.
In short, information theory is currently partaking a somewhat
heady draught of general popularity. And although this
wave of popularity is certainly pleasant and exciting for
those of us working in the field, it carries at the same
time all element of danger. While we feel that information
theory is indeed a valuable tool in providing fundamental
insights into the nature of communication problems and
will continue to grow in importance, it is certainly no
panacea for the communication engineer or, a fortiori,
for anyone else. Seldom do more than a few of nature's
secrets give way at one time. It will be all too easy for
our somewhat artificial prosperity to collapse overnight
when it is realized that the use of a few exciting words
like information, entropy, redundancy do not solve all
of our problems ".
This is a consequence of the remark that there
is identity between the form of Shannon's law H, and
a form of entropy described in terms of statistical mechanics,
in particular when describing the specific case of the perfect
gas. However, identity in the form of a mathematical description
of a process is by no means a reason sufficient to identify
the underlying principles, or the very processes with each
other. Moreover, several words such as information, entropy,
or order, have been used in many different contexts, meaning
in each context something very precise, but usually very different
from what it means in another context (see The
Delphic Boat, for more). Because the ideas of information
and order are often used in biology and may influence the exercise
of the corresponding knowledge, it seems useful to explore
some of the underlying reasons that have permitted authors,
in several instances, to identify the negative value of information
and entropy.
Indeed, the reason underlying the frequent identification
between information and entropy is ideological in nature, and
corresponds to a mistaken identification of disorder and entropy,
coupled to an unsaid view of the world as spontaneously evolving
towards a "bad" ordering. The underlying assumption
is that the world is constantly going towards an unavoidable
disorder, a fact against which we should fight (see Schrödinger's What
is life? and the foreword to its French translation). The
starting point of such pessimistic views of the world comes
from the mechano-statistical model of thermodynamics, founded
upon the very specific and restricted case of the perfect gas.
In this model, gas atoms are hard, impenetrable spherical points
which interact by pure elastic collisions. The popular description
is then that of a container made of two independent chambers,
containing different gases and separated by a rigid wall. At
the start of the process a small hole is drilled through the
wall. It is subsequently easy to observe that the container
evolves in such a way that after some time there is an equivalent
mixture of both gases in both chambers, each gas having a tendency
to occupy the volume from which it was absent, till a final
equilibrium where the partial pressure of the gas is equal
in both chambers, an equal number of gas atoms coming from
one chamber to the other one and vice versa. If one wanted
to prevent homogenisation it would be necessary to place a
little demon at the hole, with the control of a door permitting
him to close the door when the wrong gas is going to the chamber
it does not belong to. One then asserts that the action of
the demon corresponds to the knowledge of an information on
the gases nature
This is thought to be enough to justify
the identification between information and entropy. Many technical
objections have been given to such a primitive description,
but one is compelling: in this way of accounting for a system,
its entropy depends on its level of description. However entropy
can also be defined as a macroscopic physical entity linked
to the total energy of a system and its temperature. It is
therefore independent of the model. Thus, the identification
between entropy and disorder is disputable (let us consider
a two-dimensional representation: a blue gas in a square chamber
is separated from a yellow one. Drilling a hole in the separation
line will yield a green rectangle through mixture of the blue
and yellow gases. Is the green rectangle more disordered than
the juxtaposition of a blue square to a yellow one?). In addition,
identification between entropy and information is also misleading,
because one can immediately notice here that information is
not only a qualification of a state of a given system, but
that it requires both some knowledge of the system and of the
observer. Like order, information is a concept that is relative
and must take into account several systems (see The
Delphic Boat, Chapter 4).
In fact Shannon's measure of entropy does only
reflect a mathematical property of statistical analysis, as
discussed below. For the moment however, limiting Shannon's
information to what its mathematical form assumes it is, it
can be used for a first description of DNA sequences information
content. This first level of local information can be used
for the identification of coding regions in DNA. A more evolved
use of Shannon's information is also present in the analysis
of gene sequences using periodical Markov chain analysis, as
demonstrated by Borodovsky and co-workers. A consequence of
this type of work is the obtaining of large collection of gene
sequences from a given organism. This brings us to the use
of informatics in the analysis of whole genomes.
A genome is the result of a historical process
which has driven a variety of genes to function collectively.
Many biologists using computers have experienced the following.
When one scans through a sequence library for similarity with
a query sequence, one often ends up with a list of sequences
that have only in common the organism to which the initial
sequence belongs. This is as if there existed a style specific
of each organism, independent of the function (an image proposed
by P. Slonimski is that of columns in a temple, which can be
of Corinthian, Dorian or Ionian style, having however always
the same function). Is this impression true? This indicates
that there must exist some sort of mutual interaction permitting
the cell to discriminate between self and non-self. It seems
therefore interesting to investigate the meaning, at the lowest
possible level, of what could be mutual information.
In Shannon's terms, it is natural to define mutual
information between pairs of events. This is a function that
is always less or equal to the information that one possesses
on individual corresponding events.
H(pIJ ; pI . pJ)
= H(pI) + H(pJ) H(pIJ)
= ∑
{pij log2 (pij /pi . pj)
| i ∈ I, j ∈ J}
= ∑
{pi . pj (pij /pi . pj)
log2 (pij /pi . pj) | i ∈
I, j ∈
J}
= ∑
{pi . pj (pij /pi . pj)
| i ∈ I, j ∈ J}
Now, in usual statistics, there exists another
means to evaluate correlation between events, using a specific
metrics known as the χ2 test.
This metric is meant to evaluate the mutual distance (this
can be interpreted as some mutual information) between two
laws. It is designed in such a way that the mutual distance pIJ is
equal to pI . pJ when all the
knowledge about the association of the two laws is the result
of the knowledge about each one:
||pIJ pI . pJ ||2 = ∑ {(pij
pi . pj)2/pi . pj |
i∈
I, j ∈ J}
= ∑ {(pi . pj [(pij /pi .
pj)2 1] | i ∈ I, j ∈ J}
= ∑
{pi . pj (pij /pi .
pj) | i ∈ I, j ∈ J}
knowing by definition of the laws pI (normalisation)
that the sum of the product pi . pj as
well as that of the mutual law pij is equal
to 1.
One can then remark that the final expression
has a general form similar to the form of the Shannon's mutual
information law: one has simply to exchange the form of the
distance function involved in the χ2 paradigm,
where it is x2-1, for the form in Shannon's
law where it is xlog2x. Both functions have
a very similar behaviour around x0 = 1: their
value, their first derivative as well as their second derivative
are equal. In fact, if one proposes to measure mutual information
by various plausible laws (mutual "distances"), one
is led to the conclusion that the envelope of all such laws
is Shannon's information function. This has the consequence
that the function xlog2x is the natural function
that should be used in most cases. But this conclusion can
also be extended, implying that it is natural to find this
same function in many situations otherwise unrelated: when
finding that the representation of a phenomenon yields a function
of the form xlog2x one should not infer that
the analogy extends to the nature of the phenomena described
by this same function!
Methods using χ2 are
numerous (they can even be used in learning techniques for
the partition of training samples). Because it is very powerful
but rarely used in biology we shall only refer to one such
method, Correspondence
Analysis (CA), which makes use of χ2 distances
for classification of objects without a priori knowledge of
the classes. The underlying idea is that one can construct
pertinent classes where objects do not obligatorily share simultaneously
all of the characters defining the objects in a given class,
and nevertheless can be said to belong to the same class because
they share most of the characters defining a class. Such an
analysis permits one not only to build up classes, but also
to build up trees describing the relationships between classes.
This permits one under some conditions to build up hierarchical
clusters, reminiscent of phylogenies. CA has been used to study
codon usage in individual genomes. In the absence of any other
knowledge, coding DNA sequences (CDS)
can be described by the usage of codons specifying each amino
acid residue of the polypeptide they encode. Accordingly, each
CDS is represented as a point in a 61-dimensional space, each
dimension corresponding to the relative frequency of each of
the 61 codons. The set of CDSs is displayed as a cloud of points
in the space of normalized codon frequencies. Using the χ2 distance
between each CDSs pair, CA allows calculation of the 2-D projection
of the cloud of points yielding maximum scattering. On such
projection, genes that have a similar codon usage will appear
as neighbour (but the converse is not necessarily true). In
order to analyze this graphical representation in terms of
individual classes, it is necessary to use a second method
that automatically clusters the objects (here, the CDSs) which
are close to one another. For example, in a first step, one
splits the collection of objects into k groups by a dynamic
clustering method; then, in a second step, objects which are
always clustered together in the course of the different partition
processes, are selected. Analysis of E. coli CDSs revealed
that the hypothesis requiring the minimum assumptions is a
clustering into three well separated classes. The codon bias
is very strong in class II, intermediate in class I and weak
in class III. The proof that this partition is significant
comes from the observation that these three classes of E.
coli genes can also be clearly distinguished by their biological
properties. Class I contains those genes that maintain a low
or intermediary level of expression, but can be potentially
expressed at a very high level (e.g. the lactose operon).
In contrast, class II contains genes that are constitutively
expressed at a high level during exponential growth. Most of
these genes are involved in the translation and transcription
machinery, as well as in the folding of proteins. In class
III, the corresponding genes have been shown to be involved
in functions required for horizontal transfer of genes.
With these examples, we observe that investigation
at a first level of the information content of a genome is
rewarding. In particular, it tells us that the actual selection
pressure on the fine evolution of sequences inside genes is
visible when one possesses a large collection of genes for
analysis. Because this corresponds to a fine variation, such
as a general bias in nucleotide distribution in genes, and
comprise many genes at the same time, it becomes difficult
to be able to propose experiments to test whether the interpretation
of the observation adequately reflects reality. Indeed, it
must be difficult to submit a modified organism to a "normal" life-cycle
(i.e. a life-cycle that differs strongly from laboratory
conditions, including chemostats) and to test whether its DNA
has evolved according to the predicted consequences of a model.
In some cases however experiments have been performed which
can in answer some of the predictions: Yanofsky and co-workers,
for example, have submitted a culture of E. coli lacking
a functional mutT gene (the product of which prevents
AT to GC transitions) for evolution during a large number of
generations, and they have observed that, after some time,
the overall content of the DNA was significantly more GC-rich
than the parental strain.
In the examples just given the information content
of a sequence was considered mostly as a function of its local
value. A more appropriate way to consider the same information
would be to evaluate it as a function of its context, as we
shall see now.
A genome sequence is certainly far from being
random. But this is not only due to the fact that it actually
contains functional information (i.e. information used
for the definition of molecules that operate in metabolism
and cell construction), but also because the starting DNA molecules
were certainly not random. For this reason, given a sequence,
it is very difficult to estimate its actual probability: one
has to compare it to an unknown baseline. When studying the
frequency of words (such as TA, CCTTGG, for instance), one
has to estimate the probability of occurrence they would have
if the sequence was random. This does not mean that it can
compare to a purely stochastic sequence of letters of same
overall probability of occurrence: a way to overcome this difficulty
is to choose regions of the genome, such as coding sequences,
where it is known that the major selection pressure has been
operating only indirectly on DNA, using this as a means to
calculate a baseline. For example, one can study the frequency
of words overlapping codons (four or more letters) by comparing
the level of occurrence of a given word with its level of occurrence
if all the codons used in every CDS were taken into account,
but where their actual succession would be random. In this
case most of the bias comes from the style of the genome (which
can, in part, be reflected by a non-random frequency of dipeptides
in proteins, resulting in some contribution to the bias in
the usage of words). As an illustration; it can be observed
that among 4-letter palindromic sequences, CTAG is counterselected
in many organisms (except those where C is modified, such as
bacteriophage T4 where C is replaced by hydroxymethyl cytosine).
This seems to be due to a special conformation of the double
stranded helix when it carries such sequences of motifs, resulting
in some weakness (perhaps inducing spontaneous breaks). In
all other cases the words frequency reflects the specific style of
each genome.
Is it possible to go farther, and extract significant
global properties of genomes? Relationships between objects,
such as those which are of fundamental importance in biology,
should be taken into consideration. In particular, the very
nature of the information carried, from generation to generation,
in the DNA molecule that constitute genes is not at all present
in the analogy between entropy and Shannon's information. For
example "context",
"meaning" or "semantics" are, by construction,
absent from this extremely primitive model. Biology provides
us with a metaphor that displaces the idea of information towards
a new field, that of programming and informatics. Is there more
insight in these new fields than in the "natural" way
of considering information? At least since 1965, Kolmogorov and
Chaitin, following Solomonoff in the USA and the Russian school
of electronics engineers, have formulated the problem in details
in the following way (for a general description see Yockey,
1992). Let us consider the simple case of a chain of characters
such as those found in computer sciences. What can be said about
its information content? A way to consider the chain is to try
to reduce its length so that it can be accommodated in a memory,
for example using the minimum space, without altering its performance
when used in a program for example (at least in terms of accuracy,
if not in terms of time), in short, without losing its information
content. This is called compressing the data. This problem is
of a very broad and general interest. Because a chain of characters
can be identified with an integer, the universal Turing's computation
rules apply. It is in particular possible, given a chain, to
define the shortest formal program (in terms of Turing's
universal computation algorithms i.e. algorithms
that can be implemented on any machine operating on integers)
that can compress an original chain, or restore it given its
compression state. The information value of a chain S is therefore
defined in this model as the minimal length of the universal
program that can represent S in a compressed form.
With this definition it appears that a completely
random chain S cannot be compressed, implying that the minimal
program required to compress S is identical to S. It should
be noticed here that an apparently random chain may be compressible:
this is the case for instance of the chain formed by the digits
of the decimal writing of the number π, which can
be generated by a short program. In contrast, a chain made
of repetitive elements can be summarised as a very short program:
the sequence of the element, and the instruction "repeat".
In this context, the information of a sequence is defined as
the measure of its compressibility (one often uses the word "complexity" rather
than information, in this context). It is however apparent
here that this is not all of what can be said, should we wish
to define information in a finer sense. Indeed, the program
defining π is highly compressible, as is the program
generating the sequence 01010101
, but we would like to
say that the information imbedded in π is richer
than the one imbedded in 01010101
Represented as sequences of letters, genes and
chromosomes have an intermediate information (complexity) content:
one finds local repetitions or in contrast, sequences which
are impossible to predict locally. Their complexity is intermediary
between randomness and repetition. The apparent complexity
of sequences originating from higher eucaryotes or procaryotes
is very different, and this links genomic sequences to both
sides of the seemingly uninteresting fraction of information
(because it corresponds on the eucaryotic side to repetition
and on the procaryotic side to randomness). The complexity
of the former is more repetitive, and looks usually much lower
than that of the latter, which looks more random. This is quite
understandable if one remembers that bacterial or viral genomes
are submitted to stringent economy constraints implying that
they must remain very compact. In contrast genomes of higher
eucaryotes are extremely loosely constrained, and they contain,
for instance, many repetitive sequences.
However, would it not be more natural to say
that eucaryotes are more complex than procaryotes? We can propose
here that this difference in the form of algorithmic complexity
is perhaps the actual cause of the main morphological differences
between procaryotes and differentiated organisms. The former,
following their strategy for occupying biotopes thanks to a
rapid and versatile adaptive power, are forced to keep up with
a small genome. This implies that several levels of meaning
must often be superimposed in a given sequence (for example
promoters overlapping CDSs). And the apparent result of this
constraint is that their actual algorithmic complexity looks
more like that of a random sequence. But this constraint has
an important physical consequence (because DNA is not only
an abstract sequence of symbols, but also a molecule that must
occupy space): superposing signals on the same DNA stretch
generally precludes to combine synchronous recognition processes
by specific proteins, for it is not possible to place two different
objects at the same time at the same location. But, to exclude
such combinations (which alone can fulfil the logical principle
of third party exclusion) prevents investigating the refined
principles of a delicate modulation of gene expression The
exactly opposite situation is found in differentiated organisms
such as animals (and perhaps in the regions of procaryotic
genomes displaying a low complexity). The lack of limitation
in the length of the DNA text permits juxtaposition of recognition
signals by regulatory proteins, and allows exploration of the
properties of their association according to the rules of combinatorial
logic. This is likely to be what permits differentiated expression
as observed in the various cell types. Indeed, this is already
well understood for the regions controlling transcription (promoters,
enhancers and the like), but this is probably also the reason
for the multiplication of introns in higher eucaryotes, as
they behave as insets without apparent signification, in the
very coding regions of genes
Thus, analysis of information
in terms of algorithmic complexity permits us to propose explanations
accounting for the difference between major organisms types,
while Shannons information does only give a meaning to
conservation of a DNA sequence, with a low level of errors,
from generation to generation, through the replication machinery.
Many analyses of genomes indirectly pertain to
analysis of their algorithmic complexity, and this provides
an interesting view of the general organisation of base sequences
in chromosomes. But this takes genomes as fixed, synchronous,
entities. In general terms, higher eucaryotes seem to be much
more complex than bacteria, be it only because they are able
to exhibit time-dependent differentiation patterns. It appears
therefore that something more should be said about their information
content: information cannot be identified simply with algorithmic
complexity, so that Chaitin and Kolmogorov's information
only represents part of what we are interested in. In the examples
just provided the concept of information did not imply the
notion of value (depending both on the object considered
and on the nature of the problem that has to involve the object).
In fact, if one considers an information its actual availability
must have some value. An information which is in the reach
of everyone must differ from one which is difficult or long
to obtain. In the case of living organisms, knowledge of the
paths which have been followed through evolution leading to
the organisms as we know them today, are a very valuable piece
of information, difficult to obtain. Indeed, having access
to this type of information may require the introduction of
time, as we shall now consider.
We have seen that not only would we like to have
an idea of the complexity of a sequence (as evaluated using
the above approaches), but also of the availability of the
information contained in the sequence. This is particularly
relevant in the case of DNA sequences, because all that can
be inferred from the knowledge of a sequence derives from the
accessibility of its actual information content. To see more
repetitions in eucaryotes and more apparent randomness in procaryotes
does not account for the paths which have led to such a large
overall difference in genomes. We wish to know the usefulness
of the information in the sequence. A very short recursive
program can be written to write the digits of π and
it is possible to write a program having the same length, generating
a repetitive sequence. But the availability of the information
content of the outcome of the corresponding programs is very
different. Although it is easy to predict, after a very short
computation, the value of the nth digit of
the repetitive sequence, the same is not (yet) true for π.
In the latter case, if n is chosen large enough, a long
computation time is required until the digit is generated,
using the program permitting calculation of π.
This observation is essentially important when
one considers sequences generated by recursive programs.
In many cases, the writing of the program can be short because
the procedure that is reiterated is nested inside the program.
Therefore the information content of the sequence, in terms
of algorithmic complexity, is small, because it can be summarized
as a short program. But the information on a given digit could
be extremely tedious to obtain, and could be obtained only
after a very long computation time. In this respect the knowledge
of the precise value of this digit, the "information"
on it would seem to be very high, had we another quick means
to obtain it. Bennett has formalised this aspect of sequences,
in terms of "logical
depth". The logical depth of a sequence S is the minimum
time required for generating S using an universal Turing machine.
Obviously, knowing a sequence, it is not possible to know its
algorithmic complexity. It is even less obvious to have an idea
of its logical depth.
For this reason, the conceptual analysis made
above might appear to be only philosophical (and philosophical
is a bad word in English
but not in Latin or Greek civilisations!). Note that deep thinkers
such as the late Rolf Landauer (who was at the root of many important
concepts in information theories, demonstrated both experimentally
and theoretically that widespread
ideas about entropy and order, as well as creation of information,
are plainly wrong) is at the origin of the discovery that
creating information does not require consumption of energy.
Here, we can propose efficient means to analyze
both the algorithmic complexity and the logical depth, by proposing
algorithms meant to generate the sequence which is under analysis.
For this we must propose educated guesses, and our contention
is that, starting from an analysis of the underlying biological
processes, we can indeed propose constructive hypotheses. For
example, trying to represent how RNA polymerase recognizes
its promoter will permit us to generate an algorithm permitting
to identify promoters from sequences. Cases in point are the
algorithms proposed by d'Aubenton et al. for the recognition
of transcription terminators or by Michel and co-workers for
the recognition of group I introns. The same is true in the
case of logical depth: if this aspect of information carried
by DNA is relevant, then the algorithms needed for analysis
should require significant computing time for giving interesting
outputs. This is particularly true in the case of phylogenetic
analyses, and indicates that algorithms that perform too fast
are necessarily flawed with respect to the biological relevance
of their output. A positive side of this very negative observation
is that time adds a new dimension to sequence analyses: it
could be interesting to add to the features permitting comparisons
between sequences, the time required for the computation algorithms
performed on them, because similar sequences ought to perform
with similar computation times.
Logical depth relates to the Aristotelian distinction
between potentiality and reality. It rightly suggests that
one should not be allowed to speak in terms of potentiality
as if in terms of reality. There are cases when to speak of
potentiality is nonsense, because the actualisation of the
potential, in true reality, is impossible in a reasonable period
of time. Only those facts that are mechanistically generated
could permit such a gross identification between potentiality
and future reality. It must be stressed here that reflection
on this aspect of information does not completely exhaust the
non-clarified (natural) concept of information. In particular,
it rests on the existence of an abstract, but perfect, Universal
Turing Machine. In the case of living organisms
provided that one can identify their behaviour as that
of Turing machines
the building up of the machine is the result of the actualisation
of a finite program, present in the machine. But here one should
take time into account once again, and consider only those algorithms
the length of which is smaller than the lifetime of the machine
as a given, unaltered, structure. Indeed, a time limit, defining
the algorithms permitting to construct the machine, must also
be taken into account: it defines a critical depth, likely to
be of major importance in the case of life. We shall limit ourselves
to these description however to consider now the specific case
of living organisms.
Since the time when protein and nucleotide sequences
have been available (and this goes back to the early fifties
for protein sequences), scientists have attempted to compare
sequences to each other, and to align those sequences that
were supposed to match each other after having diverged from
a common ancestor. A paragon study of such comparison was the Atlas
of Protein Sequences and Structures developed by Margaret
Dayhoff and her co-workers. After thorough analyses of the
phylogenetic divergence in families of globular proteins such
as cytochromes or globins, this work permitted construction
of a correspondence matrix between amino-acids that reflected
some of the evolutive fate of each residue. Hundreds of articles
have been written proposing general and less general approaches
for creating alignment of sequences and generating the corresponding
consensus.
While it was relatively easy to align similar
sequences, it was often impossible to align very divergent
sequences unless some other related sequences were also known
and compared in parallel (multiple rather than pairwise alignment).
This resulted in the hunt for intelligent algorithms permitting
alignments and multi-alignments, and providing the corresponding
consensus of multiple sequences. However, it is clear that
the production of alignments requires not only production of
a matrix for correspondence between nucleotides or similar
amino-acid residues, but also a choice for the penalties assigned
to the gaps which are present in the sequences when they are
aligned. The use of dynamic programming can provide a first
set of aligned pairs. Many programs, derived from or created
independently from the Needleman and Wunsch algorithm, including
use of dedicated hardware, have been proposed. Unfortunately,
these programs cannot be used for large sets because of combinatorial
explosion. In contrast, looking for common words in a composite
sequence made of the concatenation of all sequences to align,
followed by a manual fine alignment where the unformalized
knowledge of the biologist is used can yield interesting results.
In this process matrices meant to represent similarities between
amino-acids can be used for protein alignment. It is interesting
to use matrices which take into account the constraints on
amino-acid residues which have been measured from comparison
of known 3D structures, and multiple alignments can also implement
knowledge from such structures. But it must be stressed that
a single matrix is used throughout the sequences, meaning that
one deals with a composite view of the equivalence rules between
amino-acids at each position in the sequence. This cannot represent
what happens in reality because at some positions for example,
it is the aromatic feature which is important (meaning that
F, Y and W are equivalent) whereas at other positions it could
be the fact that the residue is big (meaning that F, Y, R,
H and W are equivalent) so that one should use an equivalence
matrix for each position in the sequence. This indicates that
new research programmes must still be developed in this field,
despite the large number of papers already published. As
a consequence the consensus approach is bound to fail as soon
as sequence are significantlly diverging.
Thus, when significant numbers of sequences are
compared with each other one may wish to extract a significant
pattern not as a consensus sequence, but, rather, as a matrix
of probability reflecting the frequency of a given motif at
a given place. This is reminiscent of pattern recognition problems
which involve, for example, learning techniques or techniques
used in pattern recognition by vision. It is not unexpected
therefore that methods involving neural networks, or the precursor
of this new field of informatics, the Perceptron, have been
used for generating pertinent patterns. Stormo and co-workers
have pioneered the field by using a Perceptron approach for
creating matrices permitting identification of the ribosome
binding sites in E. coli messenger RNAs, and they have
been followed by many other authors both for the study of DNA
and protein sequences. In addition to the fact that it rests
on the hypothesis that one must find consensus matrices, a
main drawback of the Perceptron approaches, unfortunately also
present in many neural network learning techniques, is that
they involve formal "synapses" which have a transmitting
strength evolving in a quantitative fashion as a function of
their actual use. This means that, except in the cases where
there is obvious and strong similarity between objects in the
training set, the synapse strength fluctuates, going up and
down. As a consequence, when the set size increases the strength
of each synapse generally goes through an optimum value, and
then slowly goes back towards a more or less average value
as more exceptions invade the training set, thus loosing discrimination
power. Accordingly, as the training set of examples increases
in size, the actual efficiency of the consensus matrix created
by the learning procedure looses accuracy. Ancestors of neural
networks which evolve in such a way that the effective transmitting
capacity of a synapse goes irreversibly to zero when its value
falls below a certain threshold value, freezing the learning
state at an optimal value, have been proposed (Danchin,
1979). This feature has been recently implemented in neural
networks by (Horton
and Kanehisa, 1992). It would therefore be important to
compare and develop new approaches involving neural networks
with emphasis on the stability of their learning capacity as
a function of the training sets.
Other learning techniques rest on the explicit
construction of grammars, i.e. sets of structures and
rules permitting the complete description of a sequence or
a network of genes. They can be divided roughly into two classes,
learning by assimilation (i.e. by measuring conformity
to features of a training set), and learning by discrimination
(i.e. by constructing descriptors of the differences
between the sets of examples supposed to differ from each other).
An example of the first type is the Calm software, used for
identifying the calcium binding sites in calmodulin, for instance,
whereas an example of the second type is the Plage software
used in the identification of the relevant features of signal
peptides in E. coli. Both types are dependent on a grammar
which is used to generate descriptors that are used on the
training sets. In general, grammars are essential to describe
biological phenomena, and their efficiency rests heavily on
the actual biological knowledge of their conceptors, providing
therefore an interesting way to incorporate biological knowledge
into automated procedures. For this reason one should probably
always consider as the most pertinent ones, the grammars which
are devoted to a specific question. A case in point is the
work of Collado-Vides, who constructed a grammar for the identification
of regulated promoters in Escherichia coli.
The argumentation that has been just developed
is meant to be a substratum for the reflection on the metaphor
of program (algorithm) which is currently used in molecular
genetics. It is known that a giant molecule, DNA, made of the
sequential arrangement of four types of related molecular species
specifies entirely the structure and dynamics of any living
organism (adding its own constraints to a world where physics
and chemistry operate, permitting in particular to generate
compartmentalisation and metabolism). We shall not consider
here the problem of origins of such organisms, but only try
to investigate the nature of the relationship between the DNA
sequence and the building up of an organism. As in the case
of the reflection of Turing in Number Theory, one can surmise
that there must exist a machine to interpret the content of
a DNA sequence. This machine is a pre-existing living cell,
made of a large, but finite, number of individual components
organized in a highly ordered way. Given such a machine, a
fragment of DNA, provided it contains a well-formed sequence,
is necessary and sufficient to produce a specific behaviour
representing part or all of the structure and dynamics of an
organism. The appropriateness of the alphabetic metaphor used
to describe the DNA sequence is further demonstrated by the
fact that it is possible to manipulate on paper four symbols
A, T, G, C, and to organise them in such a way that, when interpreted
into chemical terms (i.e. linked into the proper sequence
arrangement into an artificial DNA molecule) they produce a
behaviour having at least some of the properties of real natural
biological entities. This is the basis of survival and multiplication
of viruses as well as of a success story in biotechnology:
it is possible to make bacteria synthesise a human growth hormone
that has been demonstrated on children to be as active as the
natural product.
That the machine is, in a way, independent of
DNA has also been further indicated by the fact that it is
possible to clone organisms by replacing the nucleus of a cell
by a new nucleus, resulting in the construction of a new organism
(i.e. both a new DNA molecule, and a new machine). However,
DNA alone is absolutely unable to develop into a living organism:
this is why the question of origin must be considered if one
wishes to resolve this chicken and egg paradox, through the
coevolution of the machine and of the program. It must be stressed
here that to apply the consequences of Turing's arguments,
it is in principle necessary that program and data be well
identified and separate entities. As stated above, biotechnology
uses this fact to permit synthesis of human proteins in bacteria.
But this corresponds to small genome fragments. Virus which
infect cells do the same to reproduce. Finally, at least as
a first approximation, bacteria behave as if separating data
(the cell machinery, excluding DNA) and program (the chromosome):
a fifth of the Escherichia coli chromosome is made of
DNA on which selection operates in such a way that it favours
and sustains horizontal exchanges not only between bacteria
of the same species, but also between distantly related organisms
(class III genes).
Now, if one considers the role of a DNA molecule
in a cell it is possible to describe the various functions
derived from its presence as calculable in an algorithmic way.
Many routines in the process are recursive (i.e. call
themselves, but must terminate). It is a generative process
which permits generation of both a DNA molecule and of the
machine and data necessary to interpret it. Although this has
often been called self-reproduction, it is production of an
organism that is similar if not identical to the starting organism.
One should speak therefore of self-reference. A living organism
is in this context the algorithmic actualization of a program,
the initiation of which has been obtained through a selective
process, keeping only actualizations which are stable in time.
And it should be noticed here that this means that selection
does not operate on the phenotype (data) nor on the genotype
(program) but on the unfolding of the algorithm, i.e. on
the organism itself, thus placing in a new light the usual
paradox facing all selective theories, and standard Darwinism
in particular.
In this respect it is possible to discuss the
information content of a DNA sequence or its complexity, using
the various definitions given above. At first sight, most if
not all of the sequence must be specified to generate the daughter
sequence. Things would be simple, and the corresponding information that
which is required for reproduction would even be rather
poor, if there were an absolute fidelity in replicating the
original. However, the rule of replication (A -> T, T -> A,
G -> C, C ->G) is not absolutely faithful, even if it
can be considered to result from an algorithmic process. Variations
are imbedded in the replication process, as a function of the
general environment of the organism as well as a function of
the local sequence of the DNA segment that is replicated. This
corresponds to an information that should be taken into account,
and that is very important indeed, as we shall seen later.
But it must be emphasized at this point that the existence
of error management results in specific features of the DNA
sequence. As already stressed CTAG sequences are generally
rare, and in E. coli this is compensated by an excess
in CTGG words. Spontaneous deamination of cytosines is likely
to be compensated by a bias in the fidelity of replication,
tending to create more GC pairs. In turn, this mechanism can
result in an overshoot, as seen in E. coli strains lacking
the mutT gene. Many other processes are certainly modelling
the overall GC content of genomes, resulting in a genome style
(in this case G+C content) which has been taken into account
by taxonomists for preliminary classification of organisms.
But such evolution can be much subtler. Chi sites are oriented
along the replication fork direction in E. coli, and
it is likely that, because replication is semi-conservative,
there exists some bias in regions where Okasaki fragments are
initiated (this should result in biases having a periodicity
of the order of a few kilobases). Another bias should be introduced
by the restriction modification systems present in bacteria,
as well as by the existence of methylated sites (as indeed
observed in a few cases). Finally, there should exist biases
in the DNA composition along genes corresponding to the fine
tuning of gene expression as a function of environmental conditions
and/or of location of the transcribed or translated regions
in the genes. Hénaut and co-workers have indeed observed
such biases as the enrichment in C+G of the third codon letter
as a function of the position of the codon in the genes sequence.
Despite their evidence these observations are difficult to
submit to experimental falsification, because this would require,
for example, to construct artificial genes and study their
evolution for a very long period of time in conditions that
may be very different from natural conditions (chemostats).
It becomes therefore important to make predictive tests using
computers. But to what extent can we make predictions?
Gödel has demonstrated that recursive processes
have the very interesting property to be open, in such a way
that there exists propositions in Natural Number Theory that
can be generated from a corpus of axioms and definition, and
that are not decidable in the original corpus. Gödel's
demonstration is very ingenious, and it uses a coding process
that is certainly reminiscent of the coding process known as
the genetic code. However this demonstration has a specific
feature: the aim of the demonstration, permitting Gödel
to generate his famous "sentences", is known beforehand.
The situation faced by living organisms is different: they
cannot predict the future, they can have no aim except for
(re) producing themselves. Our hypothesis will be here that
the variation provided in the replication and recombination
machinery, permitting the generation of new DNA sequences is
the very process that permits living organisms to generate
the equivalent of Gödel's sentences in terms of new, deterministic
but utterly unpredictable organisms. The process involves both
a procedure for generating variants, and a procedure permitting
accumulation of "adapted" variants. The latter is
the so-called natural selection, which must be understood as
a process screening stability (and not fitness!) of the surviving
organisms. According to this view the spencerian "survival
of the fittest" is therefore but a surviving variant of
the "instructive"
lamarckian thought.
If this is accepted, one is faced with the puzzling
observation that to evaluate the actual information content
of a DNA molecule, one has to trace back its history, thereby
asking for an extremely slow procedure: the information content
of a DNA sequence is so deep (in Bennett's terms) that it would
require the age of life on Earth to evaluate its true complexity.
Bearing this in mind we should be very careful and modest when
trying to predict the future of living organisms. In addition,
one observes that the process of natural selection associated
with generation of mutations provides a "blind"
procedure for generating the equivalent of Gödel's sentences
(in fact, Gödel was somewhat cheating, because he knew what
he wanted to obtain, which living organisms cannot do). This
process also generates many more sentences of a type the existence
of which we are unable to predict (but the existence of which
we can conjecture, simply because at least one type Gödel's
type does indeed exist). This corresponds to the truly
creative phenomena that are derived from living organisms.
Genome analysis will yield major insights into
the chemical definition of the nucleic acids and proteins involved
in the construction of a living organism. Further insight comes
from the chemical definition of the small molecules that are
the building blocks of organisms, through the generation of
intermediary metabolism. And, because life requires also in
its definition the processes of metabolism and compartmentalisation,
it is important to relate intermediary metabolism to genome
structure, function and evolution. This requires elaboration
of systems for constructing actual metabolic pathways and when
possible, dynamic modelling of metabolism, and to correlate
pathways to genes and gene expression. In fact, most of the
corresponding work cannot be of much use because the data on
which it rests have been collected from extremely heterogeneous
sources, and most often are obtained by in vitro studies. The
initial need is to collect, organize and actualise the existing
data. In order to be effective and lasting, collecting the
data should proceed through the creation of specialized
databases. To manage the flood of data issued from the
programs aiming at sequencing whole genomes specialized databases
have been developed. They make it possible not only to bypass
a meticulous and time-consuming literature searches, but also
to organize data into self-consistent patterns through the
use of appropriate procedures which are aiming at the illustration
of collective properties of genes or sequences. In addition
to sequence databases, it has then become important to create
databases where the knowledge progressively acquired on intermediary
metabolism could be symbolised, organized and made available
for interrogation according to multiple criteria.
Using organized data, it will become possible
to make in silico assays of plausible pathways or regulation
before being in a position to make the actual test in vivo.
Such well-constructed databases will also permit investigation
of properties of life which go back to the origin of life,
placing biology in a situation which is not unlike that of
cosmology.
Fortunately, in silico analysis permits
one to organise knowledge. To generate new knowledge, why not
explore neighborhoods of biological objects, considering genes
as starting points, stressing that each object exists in relation
with other objects? Inductive and abductive exploration will
consist in finding all neighbors of each given gene. "Neighbor" has
here the largest possible meaning. This is not simply a geometrical
or structural notion. Each neighborhood sheds specific light
on a gene, looking for its function as bringing together the
objects of the neighborhood. A natural neighborhood is proximity
on the chromosome. Another interesting neighborhood is similarity
between genes or gene products. Genes can have similar codon
usage bias: this is another neighborhood, as is the similarity
in molecular mass or isoelectric point. Also, a gene may have
been studied by scientists in laboratories all over the world
and it can display features that refer to other genes: its
neighbors will be the genes found together with it in the litterature
(genomics "in libro", which will require construction
of new software apt to extract information from articles automatically).
We do not possess heuristics permitting direct access to unknown
functions, and this should make clear to us that in silico analysis
will never replace validation in vivo and in vitro:
let us hope that propagation of erroneous assignments of functions
by automatic interpretation of the genome texts will not hinder
discoveries. Knowing genome sequences is a marvelous feat,
but it is the starting point, not the end.
This global view of genomes means that it becomes
impossible to study genes in isolation. Once genomes are known
it is necessary to understand the collective behavior of gene
products. This means expression-profiling experiments both
at the level of transcripts and at the level of protein products.
In parallel, the dual gene system is the metabolite system:
one must monitor the fate of all metabolites in the cell, as
a function of the genetic and physico-chemical environment
of the cell. This corresponds to the development of "functional
genomics" techniques: all require highly-evolved mathematical
and statistical approaches and are often grouped under the
fairly general (and vague) name Systems Biology.
Systems Biology rests essentially on concepts that were created
during the XIXth century, based on the ideas of homeostasis,
feedback, feedforward and a combination of those. This results
in analysis of the behaviour of cells and organisms as networks
regulated in a variety of ways, displaying several kinds of stable
or unstable behaviours, with emphasis on multistable (in fact
often bistable) equilibria.
The analogy developed into the idea that rather than analyze
existing systems it might be interesting to reconstruct them,
taking into account the principles of electronics. This became
a new avatar of the Biologie Synthétique, dreamed by Stéphane
Leduc, a Synthetic Biology, based on the electronic metaphor
of gene expression. We prefer to think of a new way of considering
biology, where cells and organisms are seen as computers making
computers, in a novel Symplectic
Biology.
A journal devoted
to this topic has been created, with challenges
defying scholars, to see how fast we will be able to answer questions
that must be at the root of any effort endeavouring to reconstruct
life, and to construct different forms of life (orthogonal life).
|