I posted my first preprint on Arxiv yesterday (18-02-2025) ! You can find it here. My goal here is to make a summary for people with a biologist background, even though the preprint is very mathematics oriented (I wrote a different summary for people with a more mathematical background ). I am currently working on applying this project in a much more biologist-friendly setting.
We consider a large haploid panmictic population of organisms with non-overlapping generations evolving under the forces of natural selection, genetic drift, mutation and recombination. Below is a representation of this population.
Figure 1.
Each line represents a haploid organism. Each organism has the same number of genes (five, located on five loci), and the gene can either be a \(+1\) gene or a \(-1\) gene (the loci are biallelic). For instance, we can imagine that having a \(+1\) gene makes you a bit taller, while having a \(-1\) gene makes you shorter.
The population can therefore be summed up as a matrix of shape \(N,L\) in which coordinate number \((i,j)\) indicates whether organism number \(i\) has a \(+1\) or a \(-1\) gene at locus \(j\).
Each new generation is produced as follows
Concerning the logfitness function \(W\), we consider that selection acts on a quantitative, fully additive trait. In practice, this means \(W(g)\) is a function of \[Z(g) := \frac{1}{L}\sum_{\ell\in[L]} g^\ell \tag{1}\] For instance, we consider a logfitness of the form \[W(g) = -L\omega\; Z(g)^2 \tag{2}\] When \(\omega>0\), this is stabilizing selection. When \(\omega\lt 0\), this is disruptive selection.
Our goal is to describe the system when the population size \(N\) and the number of loci \(L\) are both very large.
Imagine the previous system with \(L=1\) fixed and \(N\) large. Then it is a well-known result that we may describe the evolution of the frequency \(p_t\) of the \(+1\) allele with a Wright-Fisher diffusion \[dp_t = sp_t(1-p_t)dt + (\theta_1(1-p_t) - \theta_2 p_t)dt + \sqrt{p_t(1-p_t)}dB_t \tag{1} \] where \(s\) is the selection coefficient and \(B\) is a Brownian motion. I don't want you to understand the exact meaning of this equation, only that the three terms correspond respectively to natural selection, mutation and genetic drift.
It turns out that for a general \(L\gg 1\), we can write a similar equation for the frequency \(p_t^\ell\) of the \(+1\) allele at locus \(\ell\) \[dp_t^\ell = s^\ell_t p_t^\ell(1-p_t^\ell)dt + (\theta_1(1-p_t^\ell) - \theta_2 p_t^\ell)dt + \sqrt{p_t^\ell(1-p_t^\ell)}dB_t^\ell \tag{4} \] where \(s^\ell_t\) is the selection coefficient at locus \(\ell\) at time \(t\) and \(B^\ell\) is a Brownian motion.
Equation (4) looks very similar to equation (3). In particular, recombination does not appear, because recombination does not have a direct effect on allele frequencies. However, recombination does play two hidden roles in equation (4):
The important message to get from this section is that in a polygenic system, the evolution of an allele frequency looks like a Wright-Fisher diffusion, in which linkage disequilibrium plays a hidden role.
Equation (4) is not quite convenient because the selection coefficient \(s_t^\ell\) is ill-behaved. It can fluctuate randomly and depends not just on the frequencies \((p_t^{\ell'})_{\ell'\neq\ell}\) at other loci, but also on linkage disequilibrium. In such a situation, it is common to assume that recombination is strong enough, that we may neglect linkage. But it is very rare that people quantify precisely how strong recombination needs to be. The best example I know of is this one, with a slightly different model from ours (infinitely many loci, selection of order 1, no genetic drift, weaker mutation).
If we can neglect linkage disequilibrium, then when \(L\) is large, \(s_t^\ell\) can be expressed using a mean-field approximation. To explain what a mean-field approximation is, let me make a small detour.
Imagine we have a large box with agitated particles randomly moving inside without ever colliding. Imagine each particle is weakly pulled to every other. The following picture represents this:
In this picture, the green particle represents the center of mass of all the particles. The red and blue particles are two specific tagged particles. As the number of particles becomes large, we will see two phenomena
In equation (4) as the number \(L\) of loci goes to infinity the behavior of \(s_t^\ell\) becomes deterministic (just like the green particle) and any two distinct loci evolve independently.
This article is still rather far from practical applications (upcoming work will try to bridge this). Nevertheless I would like to stress the following results:
We do not believe (5) to be optimal. In particular, if \(\rho\gg L^2\ln(L)^2\) we guarantee that our description for the limit system will hold, but this does not mean that for a smaller \(\rho\) the description becomes invalid. In fact, physicists have made a remarkable work on related questions for which the phase transition seems to be around \(\rho\sim W\).
Our results are slightly more general than what I have described. In particular, we allow for arbitrary recombination mechanism. The crucial parameter we need is the probability \(r^{\{\ell_1,\ell_2\}}\) that recombination separates the genes at loci \(\ell_1\) and \(\ell_2\). From this we can estimate the harmonic recombination rate \[\frac{1}{r^{**}} := \frac{1}{L(L-1)}\sum_{\ell_1\neq\ell_2} \frac{1}{r^{\{\ell_1,\ell_2\}}}\] Then the condition (5) becomes \[\rho r^{**} \gg L^2 \ln(\rho)\] It makes sense why the harmonic recombination rate \(r^{**}\) (and not, say, the mean recombination rate) should matter. Intuitively, the smallest values of the recombination rates \((r^{\{\ell_1,\ell_2\}})_{\ell_1\neq\ell_2}\) should determine whether the shuffling force of recombination is sufficiently strong or not to keep the population well-mixed.
We also allow selection of the form \[W(g) = LU(Z(g)) \tag{6}\] where \(Z(g)\) was defined in (1) and \(U\) is any polynomial of order at most 2, with coefficients of order 1.
There is one aspect of our model which took us a lot of time to really understand, and that is the fact that (6) corresponds to very weak selection. Indeed, one should think of it this way: the difference in logfitness between the fittest organism possible and the least fit organism possible is of order \(L\). But the population only explores a very tiny proportion of the possible organisms (for instance, in Figure 1 there are only four organisms, but the number of genomes possible is \(2^5=32\)).
In fact, selection is so weak that adaptation occurs on the same timescale as genetic drift and mutation. In particular, if we consider stabilizing selection (1) with mutational bias (\(\theta_1\neq\theta_2\)), then the population can never reach the fitness optimum. This is reminiscent of Michael Lynch's drift-barrier hypothesis. But for usual quantitative traits such as morphological traits, this is questionable. Such traits are usually found to be close to the optimum in wild populations.
This figure illustrates the behavior of the population when there is mutational bias. The red curve represents the fitness landscape, while the blue curve is the population trait distribution. Our regime of selection is so weak that the population stays far from the optimum. At such a distance, stabilizing selection is equivalent to directional selection. Of course, selection acts to bring the population closer to the optimum, but mutation drives the population away.
This leads us to believe that \(W\) should not be of order \(L\), but of order \(L^a\) for some parameter \(a\geq 1\). In fact, our upcoming work will focus on the case \(a\in[1,2]\).
Mutational bias typically refers to the case whe new mutations have non-zero mean effect. One thing that really surprised us when going through the literature is how few models take this effect into account. Theoretically, it would require some sort of cosmic coincidence for mutations to agree with selection on the optimum. And in practice (as we discuss in the paper), there are empirical grounds for adding this bias in the modelling.
I'm Philibert Courau, a PhD student in École Normale Supérieure/Wien Universität (Vienna University). I'm working on probabilistic models for the evolution of biological populations, specifically quantitative genetics and polygenic adaptation. My supervisors are Amaury Lambert and Emmanuel Schertzer. You can find my Résumé here.
Not very active, but I do have a Mastodon account