Guillaume Louvel

Evolutionary genomics & bioinformatics

Softwares for computing different types of summary trees from MCMC samples

Bayesian phylogenetic tree inference is widely used, but publications usually show a single tree topology, and I was not really questioning how it is built until I needed to make one. There are several ways to do it, based on different statistical grounds.

Here are the main concepts and which programs implement them.

Basic concepts

We want to find a single tree that best “represents” a distribution of trees as produced by a MCMC sample. All trees are labelled with the same leaf names.

What “represents” means here depends on your aim: do you want a tree that is fully bifurcating but includes low probability branches, or on the contrary do you prefer an unresolved but cautious tree?

In order to decompose all trees and build the consensus, a tree must be seen as a collection of “splits”, or bipartitions of the set of leaves. Each “split” corresponds to a branch joining both complementary subsets. The frequencies of each split, i.e. the branch supports, are easily measured from the sample.

Majority Rule tree (Extended)

Here the majority rule applies to the split frequencies, meaning we build the consensus tree by combining the most frequent splits.

Variations:

  • Strict: output a partially resolved tree with all splits having >50% support
  • Extended: also include other splits in order of decreasing support, if compatible with previously included splits (Nelson 1979, Page 1989).

This is probably the most robust method, as it is the median tree with respect to the robinson-foulds distance (the tree whose average robinson-foulds distance to all sampled trees is smallest; Barthélemy & McMorris, 1986).

Software

With branch lengths estimates:
  • Phylobayes bpcomp. Extended: -c 0. Branch lengths are the mean over the trees having this branch.
  • DendroPy SumTrees (in Ubuntu repositories and PyPI). Extended: -f 0. DendroPy is very slow.
  • I implemented strict MR with branch lengths in Genarium‘s command MRtree.
Without branch lengths:
  • Phylip consense. It is actually a strict consensus tree, i.e. with splits found in all trees.
  • R ape::consensus
  • R phangorn::allCompat
  • IQtree

    iqtree -con -t <input set of trees> -minsup 0.5 -bi <burnin>
    
  • RAxML:

    -z <input set of trees> -J MR
    

or -J MRE (extended). Alternatively use -L for IC supports. Note: RAxML requires the argument -m MODEL like -m PROTGAMMALG, even if unused.

  • MrBayes allcompat.

I wrapped the R functions above in a script so that I just have to call for example ./summarytree.R MRE treelist for majority-rule extended.

Maximum A Posteriori

The most sampled tree topology (averaged over all branch lengths).

Maximum Clade Credibility

Maximum product of the posterior clade probabilities. Note that split probabilities are not independent, so this method is not giving a real posterior probability.

95% Credible Set

Smallest set of all tree topologies that accounts for 95% of the posterior probability. Use for very well resolved datasets.

Maximum CCP

CCP approximation: an approximation of the posterior probability of a tree as the product of the Conditional Clade Probabilities, which are measured from the sample of trees.

Observation: if the tree is very unresolved, the result will be very sensitive to sampling (even sampling one tree every two changed the topology in my trials). Also, Heled & Bouckaert (2013)1 surprisingly find that the CCD method is quite bad at finding topologies.

Software:

  • I implemented it in Genarium (command CCP);
  • in theory, EcceTERA with the proper weights (but my attempt failed).
  • ALE. At first I thought it would be possible to score a single tree with: CCPscore alefile treefile (my attempts crashed). To obtain the optimal tree, the following patch of ALEobserve can be used: https://github.com/ssolo/ALE/issues/39

Median tree / Minimum distance tree

A distance metric in tree-space must first be defined. Possible tree distances, as used in Heled & Bouckaert 2013 1:

  • Robinson-Foulds. As noted above, it is the Majority Rule tree.
  • Rooted Branch Score (RBS) and Squared RBS.
  • Height Score
  • Rooted Agreement Score (RAS)

Software:

  • R phytools::averageTree

Total Clade Branch (TCB)

Tree that maximizes the total length of matching branches in the posterior. In addition to the frequency, using the length is justified by the idea that longer branches are more likely to represent real branches.

Agreement subtrees

  • MAST: Maximum Agreement Subtree (Finden & Gordon 1985): remove taxa until all trees agree.
  • Generalized to include all taxa: Cranston & Rannala 2007

  • PAUP

  • RogueNaRok, Aberer et al. 2013. I think it might now be included into RaxML.

Quartets

The quartet-based methods allow not only consensus trees, but a generalization for sets of trees with different taxa, i.e. supertrees.

Estabrook, G. F., F. R. McMorris, and C. A. Meacham. 1985.

Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units. Systematic Zoology

References


  1. Heled & Bouckaert 2013, Looking for trees in the forest: summary tree from posterior samples. This study evaluates the accuracy of different summary strategies.