Brownian genomes and co

The gene's eye-view of quantitative genetics

Explainer for mathematicians (19-02-2025)

I posted my first preprint on Arxiv yesterday (18-02-2025) ! You can find it here. My goal here is to make a summary for people with a mathematical background. I wrote a different summary for people with a more biological background.

Definition and notations

A genotype is an element of \(\square_{[L]}:=\{-1,+1\}^L\). If \(\gamma\equiv(\gamma^\ell)_{\ell\in[L]}\in\square_{[L]}\) is a genotype such that \(\gamma^\ell=+1\) (resp -1), we say \(\gamma\) has the +1 allele (resp. -1) at locus \(\ell\). A (haploid) population is a probability distribution on \(\square_{[L]}\). A population represents a large set of biological organisms, each with a certain genotype. The set of all populations is denoted \(\mathbb{X}^{[L]}\). A population \(\mathbf{x}\in\mathbb{X}^{[L]}\) can also be viewed as a vector \((x(\gamma))_{\gamma\in\square_{[L]}}\). A genome of \(\mathbf{x}\) is a random variable \(g\equiv (g^\ell)_{\ell\in[L]}\in\square_{[L]}\) with law \(\mathbf{x}\).

Here is a representation of a population in which only four different genotypes are present.

The base model

Our base system is a Stochastic Differential Equation (SDE) on \(\mathbb{X}^{[L]}\) inspired by populations genetics \[ d\mathbf{X}_t =\Theta(\mathbf{X}_t)dt + \rho R(\mathbf{X}_t)dt + LS(\mathbf{X}_t)dt + \Sigma(\mathbf{X}_t)d\mathbf{B}_t \tag{1} \] There are four terms, which is unusually large for mathematicians. As we will see, all four of them interact in a way that gives a special flavor to this model. Let us summarize what the four terms mean.

The effect of mutation on an organism with genotype \(\gamma\in\square_{[L]}\) is that a -1 (resp. +1) allele becomes a +1 (resp. -1) allele with rate \(\theta_1\) (resp. \(\theta_2\)). This is pretty standard in such models. The role of mutation is to keep the system running. Without it, eventually the population would only consist of a single genotype.

Recombination is a tricky force. Physicists see it as a collision operator, which is how it is here represented. Two organisms "collide" and produce two offsprings, with parts of their genomes exchanged. This is not how a biologist will perceive the effect of recombination, but mathematically it does make things clearer.

Recombination is a shuffling force. If the system only had recombination (and possibly mutation) then it will eventually reach a well-mixed state which biologists call linkage equilibrium (LE). In this state, if \(g\) is a genome sampled in the population, then for any \(\ell_1,\ell_2\), \(g^{\ell_1}\) is independent of \(g^{\ell_2}\) (more generally, the \((g^\ell)_{\ell\in[L]}\) are mutually independent). We can write this independence property \[ \mathbf{Cov}_{\mathbf{x}}[g^{\ell_1},g^{\ell_2}] = 0 \tag{2} \] where \(\mathbf{Cov}_{\mathbf{x}}\) is the covariance associated with the population \(\mathbf{x}\). For any population \(\mathbf{x}\), there is a single population called the LE projection of \(\mathbf{x}\), denoted \(\pi(\mathbf{x})\), such that the system starting from \(\mathbf{x}\) evolving under recombination alone converges to \(\pi(\mathbf{x})\).

The parameter \(\rho\) is the strength of recombination which quantifies the rate of collision.

Natural selection acts as a potential, stabilizing certain genotypes and penalizing others. The way this is encoded is that we have a function \(W: \square_{[L]}\longmapsto \mathbb{R}\) called the logfitness function, which represents how adapted a genotype is. The effect of selection on the frequency of genotype \(\gamma\) is then \[ LS(\mathbf{x})(\gamma) = \mathbf{Cov}_{\mathbf{x}}[W(g),\mathbf{1}_{[\gamma=g]}] \tag{3} \] The factor \(L\) represents the strength of selection. It is the same \(L\) as for the number of loci, for reasons we will explain later. It is not evident from its definition, but natural selection is a collision operator just like recombination, as depicted in the figure above. Under natural selection, two organisms with genomes \(g_1\) and \(g_2\) collide and, with probability proportional to \(W(g_1)-W(g_2)\), a copy of \(g_1\) replaces \(g_2\).

A typical example of logfitness function \(W\) of interest in biology is so-called stabilizing selection \[ W(g) = -\frac{\omega}{2L}\left(\sum_{\ell\in[L]}g^\ell\right)^2 \tag{4} \] where \(\omega\) is the strength of selection. This means we favor genotypes which have as many + as -. This means selection will act against recombination, because it will tend to create negative covariance \[ \mathbf{Cov}_{\mathbf{x}}[g^{\ell_1},g^{\ell_2}]\lt 0 \] as opposed to (2). The general model we study for selection is of the form \[ W(g) = LU(Z(g)) \tag{5} \] where \(U\) is a polynomial of degree 2 and \(Z(g):=\frac{1}{L}\sum_{\ell\in[L]} g^\ell\). We interpret \(Z(g)\) as the value of a trait such as height or BMI for all organisms with genotype \(g\). The field of biology which studies the genetic basis of such traits is called quantitative genetics, and has established that they are highly polygenic. Assuming as we do that an organism's genes impact the value of the trait additively is the basis model of quantitative genetics.

Genetic drift is the source of noise in the evolution of \((\mathbf{X}_t)_{t\in[0,T]}\). We can see it as noise surrounding the action of natural selection, as if for every collision in which natural selection occurs, there were many collisions in which fitness did not matter and any organism can replace any other.

Let me stress one crucial feature of genetic drift, which is that just like natural selection it may oppose recombination, in particular preventing the population of reaching LE described in equation (2).

Putting all four ingredients together we get equation (1). Now, we want to take the polygenic limit \(L\to+\infty\). How to describe the system then ?

This is where we take the gene's eye-view. By this we mean that we use the following projection \[ p^\ell(\mathbf{x}) := \mathbf{x}[\mathbf{1}_{[g^\ell=+1]}] \] and we look at the process \((p^\ell(\mathbf{X}_t))_{t\in[0,T]}\). It turns out we get \[\begin{multline} dp^\ell(\mathbf{X}_t) = (\theta_1 (1-p^\ell(\mathbf{X}_t)) - \theta_2p^\ell(\mathbf{X}_t))dt + \mathbf{Cov}_{\mathbf{X}_t}[W(g),\mathbf{1}_{[g^\ell=+1]}]dt\\ +\sqrt{p^\ell(\mathbf{X}_t)(1-p^\ell(\mathbf{X}_t))}dB_t^\ell \tag{6} \end{multline} \] where \(B^\ell\) is a Brownian motion. This equation is interesting, because it looks just like a well-known SDE from population genetics, the Wright-Fisher diffusion \[ dp_t = (\theta_1 (1-p_t) - \theta_2 p_t)dt + s_t p_t(1-p_t)dt + \sqrt{p_t(1-p_t)}dB_t \tag{7} \] In this equation \(s_t\) represents the selection coefficient at time \(t\) and \(B\) is a Brownian motion. We see that the difference between equations (6) and (7) is in the selection term. If we replace \(\mathbf{X}_t\) by its "well-shuffled" LE projection \(\pi(\mathbf{X}_t)\) (see (2)) we find \[ \mathbf{Cov}_{\pi(\mathbf{X}_t)}[W(g),\mathbf{1}_{[g^\ell=+1]}] \simeq \overline{s}(\mu_{\mathbf{X}_t}) p^\ell(\mathbf{X}_t)(1-p^\ell(\mathbf{X}_t)) \] where \(\overline{s}\) is a function on the space of probability distributions on \([0,1]\) and \(\mu_{\mathbf{X}_t}\) is the empirical distribution of the \((p^\ell(\mathbf{X}_t))_{\ell\in[L]}\) \[\mu_{\mathbf{X}_t}=\frac{1}{L}\sum_{\ell\in[L]} \delta_{p^\ell(\mathbf{X}_t)}\] This means we can expect that if \(\mathbf{X}_t\) is close to \(\pi(\mathbf{X}_t)\) then by a mean-field approximation \[ \mathbf{Cov}_{\mathbf{X}_t}[W(g),\mathbf{1}_{[g^\ell=+1]}] \simeq \overline{s}(\mathcal{L}(p^\ell(\mathbf{X}_t))) p^\ell(\mathbf{X}_t)(1-p^\ell(\mathbf{X}_t)) \tag{8} \] where \(\mathcal{L}(p^\ell(\mathbf{X}_t))\) is the law of \(\mathbf{X}_t\).

The goal of our paper is to prove that if we assume a certain scaling relationship between \(\rho\) and \(L\) to hold as \(L\to+\infty\), then \(p^\ell(\mathbf{X}_t)\) satisfies the equation (7) with selection coefficient at time \(t\) given by (8). Ideally, we would want to find the smallest recombination rate \(\rho\) such that this holds.

This requires to control how far we are from LE. Specifically, we want to control \(\mathbf{X}_t-\pi(\mathbf{X}_t)\). This is where much of the work is. Specifically, we control the marginal \(\mathbf{X}_t^A-\pi(\mathbf{X}_t)^A\) on small subsets \(A\subseteq[L]\). Because \(W\) is of order \(L\) (see (5)), we need \(\mathbf{X}_t^A-\pi(\mathbf{X}_t)^A\) to be of order \(\frac{1}{L}\).

Without going into the details, here is the main result. We find the desired convergence to hold in a general sense if we have the scaling \[ \rho r^{**} \gg L^2\ln(\rho) \tag{9} \] Here, \(r^{**}\) is what we call the harmonic recombination rate. Basically, if you look at the figure for recombination above, you can see that for any distinct loci \(\ell_1,\ell_2\in [L]\) there is a certain probability \(r^{\{\ell_1,\ell_2\}}\) that a recombination event separates the two loci. Then \(r^{**}\) is defined with \[ \frac{1}{r^{**}} = \frac{1}{L(L-1)} \sum_{\ell_1\neq\ell_2} \frac{1}{r^{\{\ell_1,\ell_2\}}} \]

Let's give a heuristic explanation for (9). Consider an Ornstein-Uhlenbeck process \[ dY_t = -\alpha Y_t dt + \sigma dB_t \] Such a process at equilibrium has a typical distance to zero \(\sqrt{\frac{\sigma^2}{2\alpha}}\). It turns out \(||\mathbf{X}_t^A-\pi(\mathbf{X})^A||_2\) can be dominated by such an equation, with \(\alpha=\min_{\ell_1,\ell_2\in A}\rho r^{\{\ell_1,\ell_2\}}\) and \(\sigma\) being of order 1. So, in order to get that for typical \(A\), \(\mathbf{X}_t^A-\pi(\mathbf{X}_t)^A\) is of order \(\frac1L\), we need \(\sqrt{\rho r^{**}}\) to be of order \(L\). This explains (9), except for the additionnal \(\ln(\rho)\) which comes from the need to control exceptional events.

It makes sense why the harmonic recombination rate \(r^{**}\) (and not, say, the mean recombination rate) should matter. Intuitively, the smallest values of the recombination rates \((r^{\{\ell_1,\ell_2\}})_{\ell_1\neq\ell_2}\) should determine whether the shuffling force of recombination is sufficiently strong or not to keep the population well-mixed.

We believe (9) is not optimal, and I would be extremely interested if anyone wanted to help me identify and characterize the phase transition towards the so-called non-random coexistence phase.

About Me

A photo of me

I'm Philibert Courau, a PhD student in École Normale Supérieure/Wien Universität (Vienna University). I'm working on probabilistic models for the evolution of biological populations, specifically quantitative genetics and polygenic adaptation. My supervisors are Amaury Lambert and Emmanuel Schertzer. You can find my Résumé here.

Popular Post

Eventually

Eventually

Eventually

Follow Me

Not very active, but I do have a Mastodon account

Websites I like

Here is a list of websites I like