In recent years, there has been an effort to extend the classical
notion of phylogenetic balance, originally defined in the context of
trees, to networks. One of the most natural ways to do this is with the
so-called B2 index. In this paper, we study the
B2 index for a prominent class of phylogenetic
networks: galled trees. We show that the B2 index of
a uniform leaf-labeled galled tree converges in distribution as the
network becomes large. We characterize the corresponding limiting
distribution, and show that its expected value is 2.707911858984...
This is the first time that a balance index has been
studied to this level of detail for a random phylogenetic network.
One specificity of this work is that we use two different and independent
approaches, each with its advantages: analytic combinatorics, and local
limits. The analytic combinatorics approach is more direct, as it relies
on standard tools; but it involves slightly more complex calculations.
Because it has not previously been used to study such questions, the
local limit approach requires developing an extensive framework
beforehand; however, this framework is interesting in itself and can be
used to tackle other similar problems.
In the last two decades, lineage-based models of diversification,
where species are viewed as particles that can divide (speciate) or die
(become extinct) at rates depending on some evolving trait, have been
very popular tools to study macroevolutionary processes. Here, we argue
that this approach cannot be used to break down the inner workings of
species diversification and that “opening the species box” is necessary
to understand the causes of macroevolution.
We set up a general framework for individual-based models of neutral
speciation (i.e. no selection forces other than those acting against
hybrids) that rely on a minimal number of mechanistic principles:
(i) reproductive isolation is caused by excessive dissimilarity
between pheno/genotypes; (ii) dissimilarity results from a balance
between differentiation processes and homogenization processes; and
(iii) dissimilarity can feed back on these processes by decelerating
homogenization.
We classify such models according to the main process responsible for
homogenization: (1) clonal evolution models (ecological drift), (2)
models of genetic isolation (gene flow) and (3) models of isolation by
distance (spatial drift). We review these models and their specific
predictions on macroscopic variables such as species abundances,
speciation rates, interfertility relationships, phylogenetic tree
structure…
We propose new avenues of research by displaying conceptual questions
remaining to be solved and new models to address them: the failure of
speciation at secondary contact, the feedback of dissimilarity on
homogenization, the emergence in space of reproductive barriers.
In a recent paper, the question of determining the fraction of binary trees that contain a fixed pattern known as the snowflake was posed. We show that this fraction goes to 1, providing two very different proofs: a purely combinatorial one that is quantitative and specific to this problem; and a proof using branching process techniques that is less explicit, but also much more general, as it applies to any fixed patterns and can be extended to other trees and networks. In particular, it follows immediately from our second proof that the fraction of d-ary trees (resp. level-k networks) that contain a fixed d-ary tree (resp. level-k network) tends to 1 as the number of leaves grows.
We introduce a biologically natural, mathematically tractable model of random phylogenetic network to describe evolution in the presence of hybridization. One of the features of this model is that the hybridization rate of the lineages correlates negatively with their phylogenetic distance. We give formulas / characterizations for quantities of biological interest that make them straightforward to compute in practice. We show that the appropriately rescaled network, seen as a metric space, converges to the Brownian continuum random tree, and that the uniformly rooted network has a local weak limit, which we describe explicitly.
Cultural Transmission of Reproductive Success (CTRS) has been observed in many human populations as well as other animals. It consists in a positive correlation of non-genetic origin between the progeny size of parents and children. This correlation can result from various factors, such as the social influence of parents on their children, the increase of children's survival through allocare from uncle and aunts, or the transmission of resources. Here, we study the evolution of genomic diversity through time under CTRS. We show that CTRS has a double impact on population genetics: (1) effective population size decreases when CTRS starts, mimicking a population contraction, and increases back to its original value when CTRS stops; (2) coalescent trees topologies are distorted under CTRS, with higher imbalance and higher number of polytomies. Under long-lasting CTRS, effective population size stabilises but the distortion of tree topology remains, which yields U-shaped Site Frequency Spectra (SFS) under constant population size. We show that this CTRS' impact yields a bias in SFS-based demographic inference. Considering that CTRS was detected in numerous human and animal populations worldwide, one should be cautious that inferring population past histories from genomic data can be biased by this cultural process.
Tree-child networks are a recently-described class of directed acyclic graphs that have risen to prominence in phylogenetics (the study of evolutionary trees and networks). Although these networks have a number of attractive mathematical properties, many combinatorial questions concerning them remain intractable. In this paper, we show that endowing these networks with a biologically relevant ranking structure yields mathematically tractable objects, which we term ranked tree-child networks (RTCNs). We explain how to derive exact and explicit combinatorial results concerning the enumeration and generation of these networks. We also explore probabilistic questions concerning the properties of RTCNs when they are sampled uniformly at random. These questions include the lengths of random walks between the root and leaves (both from the root to the leaves and from a leaf to the root); the distribution of the number of cherries in the network; and sampling RTCNs conditional on displaying a given tree. We also formulate a conjecture regarding the scaling limit of the process that counts the number of lineages in the ancestry of a leaf. The main idea in this paper, namely using ranking as a way to achieve combinatorial tractability, may also extend to other classes of networks.
The familial structure of a population and the relatedness of its individuals are determined by its demography. There is, however, no general method to infer kinship directly from the life-cycle of a structured population. Yet this question is central to fields such as ecology, evolution and conservation, especially in contexts where there is a strong interdependence between familial structure and population dynamics. Here, we give a general formula to compute, from any matrix population model, the expected number of arbitrary kin (sisters, nieces, cousins, etc) of a focal individual ego, structured by the class of ego and of its kin. Central to our approach are classic but little-used tools known as genealogical matrices, which we combine in a new way. Our method can be used to obtain both individual-based and population-wide metrics of kinship, as we illustrate. It also makes it possible to analyze the sensitivity of the kinship structure to the traits implemented in the model.
Measures of phylogenetic balance, such as the Colless and Sackin indices, play an important role in phylogenetics. Unfortunately, these indices are specifically designed for phylogenetic trees, and do not extend naturally to phylogenetic networks (which are increasingly used to describe reticulate evolution). This led us to consider a lesser-known balance index, whose definition is based on a probabilistic interpretation that is equally applicable to trees and to networks. This index, known as the B2 index, was first proposed by Shao and Sokal in 1990. Surprisingly, it does not seem to have been studied mathematically since. Likewise, it is used only sporadically in the biological literature, where it tends to be viewed as arcane and not very useful in practice – even though the evidence for this is scarce. In this paper, we study mathematical properties of B2 such as its distribution under the most common models of random trees and its range over various classes of phylogenetic networks. We also assess its relevance in biological applications, and find it to be comparable to that of the Colless and Sackin indices. Altogether, our results call for a reevaluation of the status of this somewhat forgotten measure of phylogenetic balance.
Starting from any graph on {1, … , n}, consider the Markov chain where at each time-step a uniformly chosen vertex is disconnected from all of its neighbors and reconnected to another uniformly chosen vertex. This Markov chain has a stationary distribution whose support is the set of non-empty forests on {1, … , n}. The random forest corresponding to this stationary distribution has interesting connections with the uniform rooted labeled tree and the uniform attachment tree. We fully characterize its degree distribution, the distribution of its number of trees, and the limit distribution of the size of a tree sampled uniformly. We also show that the size of the largest tree is asymptotically α log n, where α = (1 - log(e - 1))-1 ≈ 2.18, and that the degree of the most connected vertex is asymptotically log n / log log n.
Consider any fixed graph whose edges have been randomly and independently oriented, and write {S ⇝ i} to indicate that there is an oriented path going from a vertex s ∈ S to vertex i. Narayanan (2016) proved that for any set S and any two vertices i and j, {S ⇝ i} and {S ⇝ j} are positively correlated. His proof relies on the Ahlswede-Daykin inequality, a rather advanced tool of probabilistic combinatorics. In this short note, I give an elementary proof of the following, stronger result: writing V for the vertex set of the graph, for any source set S, the events {S ⇝ i}, i ∈ V, are positively associated – meaning that the expectation of the product of increasing functionals of the family {S ⇝ i} for i ∈ V is greater than the product of their expectations.
The mean age at which parents give birth is an important notion in demography, ecology, and evolution, where it is used as a measure of generation time. A standard way to quantify it is to compute the mean age of the parents of all offspring produced by a cohort, and the resulting measure is thought to represent the mean age at which a typical parent produces offspring. In this note, I explain why this interpretation is problematic. I also introduce a new measure of the mean age at reproduction and show that it can be very different from the mean age of parents of offspring of a cohort. In particular, the mean age of parents of offspring of a cohort systematically overestimates the mean age at reproduction and can even be greater than the expected life span of parents.
We introduce a new random graph model motivated by biological questions relating to speciation. This random graph is defined as the stationary distribution of a Markov chain on the space of graphs on {1, …, n}. The dynamics of this Markov chain is governed by two types of events: vertex duplication, where at constant rate a pair of vertices is sampled uniformly and one of these vertices loses its incident edges and is rewired to the other vertex and its neighbors; and edge removal, where each edge disappears at constant rate. Besides the number of vertices n, the model has a single parameter rn. Using a coalescent approach, we obtain explicit formulas for the first moments of several graph invariants such as the number of edges or the number of complete subgraphs of order k. These are then used to identify five non-trivial regimes depending on the asymptotics of the parameter rn. We derive an explicit expression for the degree distribution, and show that under appropriate rescaling it converges to classical distributions when the number of vertices goes to infinity. Finally, we give asymptotic bounds for the number of connected components, and show that in the sparse regime the number of edges is Poissonian.
Matrix projection models are a central tool in many areas of population biology. In most applications, one starts from the projection matrix to quantify the asymptotic growth rate of the population (the dominant eigenvalue), the stable stage distribution, and the reproductive values (the dominant right and left eigenvectors, respectively). Any primitive projection matrix also has an associated ergodic Markov chain that contains information about the genealogy of the population. In this paper, we show that these facts can be used to specify any matrix population model as a triple consisting of the ergodic Markov matrix, the dominant eigenvalue and one of the corresponding eigenvectors. This decomposition of the projection matrix separates properties associated with lineages from those associated with individuals. It also clarifies the relationships between many quantities commonly used to describe such models, including the relationship between eigenvalue sensitivities and elasticities. We illustrate the utility of such a decomposition by introducing a new method for aggregating classes in a matrix population model to produce a simpler model with a smaller number of classes. Unlike the standard method, our method has the advantage of preserving reproductive values and elasticities. It also has conceptually satisfying properties such as commuting with changes of units.
The generation time is commonly defined as the mean age of mothers at birth. In matrix population models, a general formula is available to compute this quantity. However, it is complex and hard to interpret. Here, we present a new approach where the generation time is envisioned as a return time in an appropriate Markov chain. This yields surprisingly simple results, such as the fact that the generation time is the inverse of the sum of the elasticities of the growth rate to changes in the fertilities. This result sheds new light on the interpretation of elasticities (which as we show correspond to the frequency of events in the ancestral lineage of the population), and we use it to generalize a result known as Lebreton's formula. Finally, we also show that the generation time can be seen as a random variable, and we give a general expression for its distribution.