Brownian genomes and co

The gene's eye-view of quantitative genetics

Explainer for biologists (19-02-2025)

I posted my first preprint on Arxiv yesterday (18-02-2025) ! You can find it here. My goal here is to make a summary for people with a biologist background, even though the preprint is very mathematics oriented (I wrote a different summary for people with a more mathematical background ). I am currently working on applying this project in a much more biologist-friendly setting.

The discrete model

We consider a large haploid panmictic population of organisms with non-overlapping generations evolving under the forces of natural selection, genetic drift, mutation and recombination. Below is a representation of this population.

Figure 1.

Each line represents a haploid organism. Each organism has the same number of genes (five, located on five loci), and the gene can either be a \(+1\) gene or a \(-1\) gene (the loci are biallelic). For instance, we can imagine that having a \(+1\) gene makes you a bit taller, while having a \(-1\) gene makes you shorter.

The population can therefore be summed up as a matrix of shape \(N,L\) in which coordinate number \((i,j)\) indicates whether organism number \(i\) has a \(+1\) or a \(-1\) gene at locus \(j\).

Each new generation is produced as follows

  1. Every new organism chooses two parents \(g_1,g_2\) independently, in such a way that the probability of choosing a parent with genome \(g\) is proportionnal to \(e^{W(g)}\), where \(W(g)\) is the logfitness of \(g\).
  2. With probability \(1-\frac{\rho}{N}\), there is no recombination: the offspring inherits all the genes of one of the two parents. Otherwise, we sample a random subset \(A \subseteq [L]\) (recall \([L]\) is the set \(\{1,\dots,L\}\)). Then the offspring is the genome \(g\) such that if \(\ell\in A\), then \(g^\ell = g_1^\ell\), otherwise \(g^\ell=g_2^\ell\) .
  3. The offspring mutates: any \(-1\) (resp. \(+1\)) allele mutates into a \(+1\) (resp. \(-1\)) allele with probability \(\frac{\theta_1}{N}\) (resp. \(\frac{\theta_2}{N}\))

Concerning the logfitness function \(W\), we consider that selection acts on a quantitative, fully additive trait. In practice, this means \(W(g)\) is a function of \[Z(g) := \frac{1}{L}\sum_{\ell\in[L]} g^\ell \tag{1}\] For instance, we consider a logfitness of the form \[W(g) = -L\omega\; Z(g)^2 \tag{2}\] When \(\omega>0\), this is stabilizing selection. When \(\omega\lt 0\), this is disruptive selection.

Large genome limit

Our goal is to describe the system when the population size \(N\) and the number of loci \(L\) are both very large.

Imagine the previous system with \(L=1\) fixed and \(N\) large. Then it is a well-known result that we may describe the evolution of the frequency \(p_t\) of the \(+1\) allele with a Wright-Fisher diffusion \[dp_t = sp_t(1-p_t)dt + (\theta_1(1-p_t) - \theta_2 p_t)dt + \sqrt{p_t(1-p_t)}dB_t \tag{1} \] where \(s\) is the selection coefficient and \(B\) is a Brownian motion. I don't want you to understand the exact meaning of this equation, only that the three terms correspond respectively to natural selection, mutation and genetic drift.

It turns out that for a general \(L\gg 1\), we can write a similar equation for the frequency \(p_t^\ell\) of the \(+1\) allele at locus \(\ell\) \[dp_t^\ell = s^\ell_t p_t^\ell(1-p_t^\ell)dt + (\theta_1(1-p_t^\ell) - \theta_2 p_t^\ell)dt + \sqrt{p_t^\ell(1-p_t^\ell)}dB_t^\ell \tag{4} \] where \(s^\ell_t\) is the selection coefficient at locus \(\ell\) at time \(t\) and \(B^\ell\) is a Brownian motion.

Equation (4) looks very similar to equation (3). In particular, recombination does not appear, because recombination does not have a direct effect on allele frequencies. However, recombination does play two hidden roles in equation (4):

  • The Brownian motions \((B^\ell)_{\ell\in[L]}\) are correlated. If two loci are in strong linkage disequilibrium, then the effect of genetic drift on the two loci are connected.
  • The selection coefficient \(s_t^\ell\) on locus \(\ell\) depends on what is happening at the other loci, and on linkage disequilibrium. It can encode direct selection (that is, whether the \(+1\) allele at locus \(\ell\) increases fitness) but also hitch-hiking. As a reminder, the latter effect means that an allele can be selected not because of intrinsic properties, but because it is in linked to an allele (or a group of allele) under selection.

The important message to get from this section is that in a polygenic system, the evolution of an allele frequency looks like a Wright-Fisher diffusion, in which linkage disequilibrium plays a hidden role.

The mean-field approximation

Equation (4) is not quite convenient because the selection coefficient \(s_t^\ell\) is ill-behaved. It can fluctuate randomly and depends not just on the frequencies \((p_t^{\ell'})_{\ell'\neq\ell}\) at other loci, but also on linkage disequilibrium. In such a situation, it is common to assume that recombination is strong enough, that we may neglect linkage. But it is very rare that people quantify precisely how strong recombination needs to be. The best example I know of is this one, with a slightly different model from ours (infinitely many loci, selection of order 1, no genetic drift, weaker mutation).

If we can neglect linkage disequilibrium, then when \(L\) is large, \(s_t^\ell\) can be expressed using a mean-field approximation. To explain what a mean-field approximation is, let me make a small detour.

A small detour on mean-field approximation

Imagine we have a large box with agitated particles randomly moving inside without ever colliding. Imagine each particle is weakly pulled to every other. The following picture represents this:

In this picture, the green particle represents the center of mass of all the particles. The red and blue particles are two specific tagged particles. As the number of particles becomes large, we will see two phenomena

  • On one hand, the movement of the green particle will become deterministic. This is a form of "law of large numbers", except the particles are not exactly independent.
  • On the other hand, the interaction between the red and the blue particles will grow weaker, because the blue particle hardly affects the position of the green particle. In particular, in the limit the red and blue particles will be independent.
These two effects form what is called mean-field approximation (also propagation of chaos).

Back to the genes

In equation (4) as the number \(L\) of loci goes to infinity the behavior of \(s_t^\ell\) becomes deterministic (just like the green particle) and any two distinct loci evolve independently.

What our results are

This article is still rather far from practical applications (upcoming work will try to bridge this). Nevertheless I would like to stress the following results:

  • We describe the limit behavior of \(s_t^\ell\) explicitely (in fact, what we obtain is closely related to the breeder's equation).
  • We control how strong recombination (represented by the \(\rho\) parameter in ii.) needs to be in order to guarantee that linkage equilibrium can be neglected. In a typical setting (such as a uniform crossing-over), we require \[\rho\gg L^2\ln(L)^2\tag{5}\] In particular this requires the population size \(N\) to be quite large (\(N\geq \rho \gg L^2\ln(L)^2\)).

We do not believe (5) to be optimal. In particular, if \(\rho\gg L^2\ln(L)^2\) we guarantee that our description for the limit system will hold, but this does not mean that for a smaller \(\rho\) the description becomes invalid. In fact, physicists have made a remarkable work on related questions for which the phase transition seems to be around \(\rho\sim W\).

Additional precisions on the results

Our results are slightly more general than what I have described. In particular, we allow for arbitrary recombination mechanism. The crucial parameter we need is the probability \(r^{\{\ell_1,\ell_2\}}\) that recombination separates the genes at loci \(\ell_1\) and \(\ell_2\). From this we can estimate the harmonic recombination rate \[\frac{1}{r^{**}} := \frac{1}{L(L-1)}\sum_{\ell_1\neq\ell_2} \frac{1}{r^{\{\ell_1,\ell_2\}}}\] Then the condition (5) becomes \[\rho r^{**} \gg L^2 \ln(\rho)\] It makes sense why the harmonic recombination rate \(r^{**}\) (and not, say, the mean recombination rate) should matter. Intuitively, the smallest values of the recombination rates \((r^{\{\ell_1,\ell_2\}})_{\ell_1\neq\ell_2}\) should determine whether the shuffling force of recombination is sufficiently strong or not to keep the population well-mixed.

We also allow selection of the form \[W(g) = LU(Z(g)) \tag{6}\] where \(Z(g)\) was defined in (1) and \(U\) is any polynomial of order at most 2, with coefficients of order 1.

A biological discussion

The strength of selection

There is one aspect of our model which took us a lot of time to really understand, and that is the fact that (6) corresponds to very weak selection. Indeed, one should think of it this way: the difference in logfitness between the fittest organism possible and the least fit organism possible is of order \(L\). But the population only explores a very tiny proportion of the possible organisms (for instance, in Figure 1 there are only four organisms, but the number of genomes possible is \(2^5=32\)).

In fact, selection is so weak that adaptation occurs on the same timescale as genetic drift and mutation. In particular, if we consider stabilizing selection (1) with mutational bias (\(\theta_1\neq\theta_2\)), then the population can never reach the fitness optimum. This is reminiscent of Michael Lynch's drift-barrier hypothesis. But for usual quantitative traits such as morphological traits, this is questionable. Such traits are usually found to be close to the optimum in wild populations.

This figure illustrates the behavior of the population when there is mutational bias. The red curve represents the fitness landscape, while the blue curve is the population trait distribution. Our regime of selection is so weak that the population stays far from the optimum. At such a distance, stabilizing selection is equivalent to directional selection. Of course, selection acts to bring the population closer to the optimum, but mutation drives the population away.

This leads us to believe that \(W\) should not be of order \(L\), but of order \(L^a\) for some parameter \(a\geq 1\). In fact, our upcoming work will focus on the case \(a\in[1,2]\).

The importance of mutational bias

Mutational bias typically refers to the case whe new mutations have non-zero mean effect. One thing that really surprised us when going through the literature is how few models take this effect into account. Theoretically, it would require some sort of cosmic coincidence for mutations to agree with selection on the optimum. And in practice (as we discuss in the paper), there are empirical grounds for adding this bias in the modelling.

About Me

A photo of me

I'm Philibert Courau, a PhD student in École Normale Supérieure/Wien Universität (Vienna University). I'm working on probabilistic models for the evolution of biological populations, specifically quantitative genetics and polygenic adaptation. My supervisors are Amaury Lambert and Emmanuel Schertzer. You can find my Résumé here.

Popular Post

Eventually

Eventually

Eventually

Follow Me

Not very active, but I do have a Mastodon account

Websites I like

Here is a list of websites I like