Ani Papyrus. Chapter LXXXVII (tr. Burge)
For a Symplectic Biology
Caveat: It should be noted that the content of the french and english versions may differ, as presentation of concepts are heavily dependent on the structure of the languages in which they are expressed.
Since the time of Claude Bernard, who advocated placing biology in the domain of physics and chemistry, physics has undergone a considerable revolution, marking the end of Classical Physics. The creation of Quantum Physics revealed a profound inadequacy of the close links between the four currencies - traditionally called 'categories', following Aristotle - that make up reality: Mass, Energy, Space and Time, summarised in the remarkably concise equation written by Henri Poincaré, and later Albert Einstein E = mc2. Indeed, the Quantum Physics revolution of 1905 introduced a new description of reality based on a mathematical formalism that differed completely from that which founded Classical Physics. The separation was so profound that Albert Einstein himself saw that the resolution of the contradictions inherent in these two different ways of representing Nature implied the existence of "hidden variables" that would allow us to maintain the previous basic paradigm. We now know that all experiments designed to discover these hidden variables have failed. Thus, Werner Heisenberg's inequality Δx Δp ≥ h/4π makes sense. It indirectly introduces, in addition to the four categories making up reality Mass, Energy, Space and Time, a fifth one, Information: it is indeed a lack of information that explains the indeterminacy of the position and momentum of individual quantum entities, which are, moreover, characterised by their properties within the paradigm of the four standard categories of Classical Physics.
We propose here to consider information as physical and central to understanding life. In addition, we now contextualise the relationships within the living organism between the four classical categories (which form the core of the cell's architecture and dynamics) and the site where information is deployed. The latter forms the genetic program with its carrier, the DNA molecule, constituting the organism's genome. To help readers better understand this point of view, I have chosen to recall an old paradox, created by the aging Theseus' ship to the Athenians over time. They asked the question: « Theseus' ship is made of wooden planks, which are replaced as they decay; after a while, all the planks have been changed, is this the same ship? » Yes, indeed, and what has not changed is the composition of the boat, the relationships between the planks, the shape of the boat... And this shape cannot be the consequence of any of the properties of nature in the field of the four categories of Classical Physics.
We see here, in the category of Reality that shapes the boat, that in addition to matter, information is a central part of the object of interest. Matter can even be changed in the boat: pine can be replaced by oak to make it age slower, and, provided the planks are well adjusted (and not too heavy) the boat will keep its function to float and transport passengers and goods, demonstrating that the properties of the matter used here are not specific to the nature of the boat (it could even be made of metal or cement). This is typically one of the endeavours of those who try to change some of the basic bricks making life (such as amino-acids or nucleotides ), in some of the efforts of Synthetic Biology.
We all use the word « information » over and over again. This is true in particular when we speak about the genetic program and the processes of information transfer. But do we know exactly what this word conveys? Information is one of those « prospective » concepts that change ‹hile they are discussed (Myhill 1952). The first quantitative mathematical description of the concept came from the work of Claude Shannon, when he analysed the limits of communication of strings of symbols. This theory of communication was only meant to measure whether the integrity of a message was preserved during transmission without taking into account its meaning. In the biological context, this is exactly what replication does, when a DNA molecule is duplicated into an (almost) identical molecule.
Information is obviously a much richer concept. In 1953, the physician Henry Quastler realised the importance of information theory and coding in molecular biology. His interest was however essentially driven by the question of the nature of the brain and of learning, memory and consciousness. Together with the physicist Hubert Yockey, he organised a Symposium on Information Theory in Biology, where topics in molecular biology were discussed. In a short work published posthumously, Quastler further developed a theory of biological organisation starting with the enigma of the origin of life. The central point in his essay The Emergence of Biological Organization is the emphasis posed on the problem of the creation of information in simple cells. This will be the central question we tackle here. In 2018, the Fondation Fourmentin-Guilbert attempted to repeat this experimental conference, exploring whether cells could be considered as computers making computers.
Most investigators at the time (and still many today) thought intuitively that creation of information required energy (for a historical discussion, see The Delphic Boat), in such large amounts that what we could observe in life appeared quite mysterious, as typically one expected kTlog2 of energy for one bit of information. Contrary to intuition, this is not so however: as demonstrated by Rolf Landauer, creation of information is reversible and therefore the process does not require energy. This remarkable property of the physical world has the consequence that accumulation of information by living organisms is not a paradox and can be associated to explicit molecular processes. I try here to show how we can progress in this direction.
The parallel works of Kolmogorov in Soviet Union, and Solomonoff and Chaitin in the USA, aimed at proposing definitions of information that would try to capture some of it, not in terms of communication but in terms of signification. This required to tell a random sequence from a meaningful sequence. The concept of algorithmic complexity defines a sequence by the shortest algorithm needed to generate the sequence: with this definition of sequence compression a random sequence will be said to have maximum algorithmic complexity (it cannot be compressed to a length shorter than itself) while a repeated sequence would be of low complexity . A further exploration shows that much more can be said about the very nature of what information is. In 1988, the physicist Charles Bennett created the concept of logical depth, based on the remark that two sequences with the same algorithmic complexity may differ widely in the way they carry information. In a repeated sequence (a fairly uninteresting case), for example, the information of the nth symbol, even with n large, is obtained in a straightforward way. By contrast, in sequences that are the result of a recursive algorithm, such as the sequence of the digits of π, one often cannot infer the nature of the nth symbol, when n is large, without running the algorithm, and this can take a very long time. Bennett proposed that the time required to have access to the corresponding information will measure the "logical depth" of the sequence. Examples of the non trivial features of the algorithmically simple, but logically deep, sequences are the outcome of algorithms generating fractal figures such as Koch’s flake or the Mandelbrodt’s set. Both these remarkable figures are generated by fairly short algorithms, but it is not possible to predict easily the colour of a pixel, until the algorithm has been made running. In the case of genomic sequences, the very fact that DNA comes from DNA, comes from DNA... through many generations suggests that any nucleotide in a sequence is logically deep. This indicates that there is not such thing as « junk » DNA. This is a further indication that information cannot be derived from the four categories of Nature, matter, energy, space and time, but is a category in itself. This also shows that we need to further develop deeper formulations of what information is (see an intuitive attempt to define critical depth).
Information is at the root of much of biology (and in silico biology is the demonstration of its importance), and it may even be a direct category manipulated by living organisms: analysis of the exploration of space by insect males looking for their female involves a process named infotaxis by Massimo Vergassola, which computes ways to access the source of a specific pheromone in a highly turbulent environment, using information as the driving element permitting identification of the target.
Ubiquitous functions for life
Life can be defined as combining two entities that rest on completely different physico-chemical properties and on a particular way of handling information. The cell, first, is a « machine », that combines elements which are quite similar (although in a fairly fuzzy way) to those involved in a man-made factory. The machine combines two processes. First, it requires explicit compartmentalisation, including scaffolding structures similar to that of the chassis of engineered machines. In addition, cells define clearly an inside, the cytoplasm, and an outside. The cell envelope is more or less complicated in bacteria, and it is much more complicated in organisms made of cells with a nucleus (eukaryotes). Second, the machine also requires dynamic chemical processes, collectively named metabolism, that can be split into intermediary metabolism, managing chemical transformations and transport of small building blocks and management of energy (often with a rotating nanomachine, ATP synthase), and the macromolecule synthesis, salvage and turnover machinery which uses a variety of nanomachines, the ribosome being the most prominent one. The second entity which needs to be associated to life is the genetic program, in the form of the genome, composed of one or several chromosomes made of DNA. This is the entity which associates most clearly to information.
The association between a machine and a program represented as a linear sequence of symbols is highly suggestive of the construction of a Turing Machine, the abstract representation of the human artefacts we have constructed to manipulate all the operation of computing and logics, the ubiquitous computers. Many features of cells suggest that this analogy is much deeper than a simple metaphor, and that the cell is a real implementation of Turing Machines, with the remarkable feature that these particular instances of Turing Machines use their computing power to make Turing Machines (cells make cells). Indeed an experiment such as that of genome transplantation, with generation of a Mycoplasma mycoides species driven by the chromosome of a species (M. mycoides) differing from the initial host cell (M. capricolum) is an overwhelming argument in favour of this model.
The common rebuttal raised by those who are reluctant to leave the firm ground of traditional biology against the idea of the cell as a Turing Machine is that its information content is much higher than that present in its chromosome. This argument does not hold, however, as we meet also exactly the very same situation with real computers, that nobody would challenge as material implementations of Turing Machines: the engineering of the explicit machine that reads programs requires definitely much more information than that which is in the program loaded with the Operating System (OS) that makes the computer work. Some have further argued against the « computer » model by stating that – in a biological machine – it is not possible to completely separate between the hardware and the software, and this is right. However, again exactly as in the case of the objection raised by the existence of information in the machine itself, in addition to that in the OS, the same holds true for the way the OS is carried into the machine. As in the case of DNA which carries the genetic program, while an OS is an abstract entity, to be usable it must be carried by concrete objects, such as flash memories, Compact Disks (CD) or magnetic tapes. Let us imagine a computer driven by a program stored on a CD. If the CD has been let standing for some time on the back range of a car in the sun, it will be deformed, and despite the fact that the program it carries is unaltered, it will no longer be read by the computer's laser beam. This does not alter the very existence of the abstract laws establishing what a computer is (a Turing Machine) but this tells us that in any real implementation of the Turing Machine, one cannot completely separate between the hardware and the software.
This model reminds us that there is a deep interaction between the level of information and that of matter, energy, space and time. This implies that all processes involving the abstract entity, information, need to be concretely implemented. We need therefore to look for concrete objects that implement biological functions. This latter concept is subject to a large variety of interpretations, based on questions asked by the frequent teleological arguments used to account for the existence of a given function. I simply assume here that we use « function » with its common-sense intuitive (inaccurate!) meaning as do engineers when they construct a machine.
Many ubiquitous functions are required to make a cell. They are needed to implement the essential macromolecular biosynthesis processes, transcription, translation and replication. They are also essential to manage energy and exchanges between the inside and the outside of the cell. They need to control physical constraints, such as osmotic pressure or temperature. And, of course, there is a need to synthesise the building blocks that cannot be obtained outside the cell. A standard way to try and have access to these ubiquitous functions is to think that they will be associated to ubiquitous structures. This (inappropriate) inference drove the quest for a « minimal genome » that would lie at the intersection of the genomes of all autonomous cells of a given clade because they share a common history. Unfortunately this approach subsumes the existence of one common ancestor, and it also implies that there is generally a bi-univocal correspondence between structure and function, while, in fact, many a structure can fulfill a given function. Such structures may be recruited by horizontal gene transfer, after acquisitive evolution or even created de novo (we do not have clear ideas however about that particular process, forming the path to gene creation, except for some hypotheses, such as the « gluon » hypothesis we proposed a few years ago). This approach generally led to identification of the so-called « housekeeping » genes, which are assumed to make up the minimal genome. And, as a matter of fact, these genes are identical to the approximately 250 genes first identified in silico, and then found to be essential for growth under laboratory conditions. Naturally, as with authentic computers, the final set up requires « kludges » that allow the diverse parts to fit together smoothly. These kludges will be clade-specfic and certainly not universal, while they will be necessary for each implementation of a small « minimal » genome.
Despite their amazing variety, there is a common underlying framework in all existing living organisms. This suggests that they may be the site of ubiquitous functions (not ubiquitous structures!). It is therefore a natural endeavour to try and identify those ubiquitous functions. A logical way would be to compare many genomes, and, assuming that genes are underlying functions, to compare their gene build up. The simplest way would be to devise means to identify those genes which code for the same functions (orthologs) and intersect a large number of genome sequences to identify the orthologs they share. Any attempt in this direction is however doomed to fail, because living organisms being information traps, there is no reason why any function be performed by the same structural entity. Hence there is no reason that there would correspond ubiquitous genes to ubiquitous functions.
The genes in the paleome code for function allowing construction of the cell, functions allowing replication of the program, and other functions of two types. On the one hand they manage metabolic conflicts in the cell (metabolic frustration), and on the other hand they are made of degradative complexes which use energy.
Our central conjecture is that this use of energy is not for degradation, but for preventing degradation of functional entities. This way these functions are the physical implementation of Maxwell's demons.
Note that this definition of complexity goes against a considerable number of usages, that would certainly not identify complexity with increased randomness. This is why I consider that the word should be avoided at all costs, and that instead we should use the Greek equivalent, symplectic, when we wish to describe highly organized systems de Lorenzo, V. and A. Danchin. 2008. Synthetic Biology and the discovery of new worlds and new words. The new and not so new aspects of this emerging research field. EMBO Reports. 9: 822-827.