Softwares for computing different types of summary trees from MCMC samples
Bayesian phylogenetic tree inference is widely used, but publications usually show a single tree topology, and I was not really questioning how it is built until I needed to make one. There are several ways to do it, based on different statistical grounds.
Here are the main concepts and which programs implement them.
- Basic concepts
- Majority Rule tree (Extended)
- Maximum A Posteriori
- Maximum Clade Credibility
- Median tree / Minimum distance tree
- Total Clade Branch (TCB)
- Agreement subtrees
- Quartets
- References
Basic concepts
We want to find a single tree that best “represents” a distribution of trees as produced by a MCMC sample. All trees are labelled with the same leaf names.
What “represents” means here depends on your aim: do you want a tree that is fully bifurcating but includes low probability branches, or on the contrary do you prefer an unresolved but cautious tree?
In order to decompose all trees and build the consensus, a tree must be seen as a collection of “splits”, or bipartitions of the set of leaves. Each “split” corresponds to a branch joining both complementary subsets. The frequencies of each split, i.e. the branch supports, are easily measured from the sample.
Majority Rule tree (Extended)
Here the majority rule applies to the split frequencies, meaning we build the consensus tree by combining the most frequent splits.
Variations:
- Strict: output a partially resolved tree with all splits having >50% support
- Extended: also include other splits in order of decreasing support, if compatible with previously included splits (Nelson 1979, Page 1989).
This is probably the most robust method, as it is the median tree with respect to the robinson-foulds distance (the tree whose average robinson-foulds distance to all sampled trees is smallest; Barthélemy & McMorris, 1986).
Software
With branch lengths estimates:
- Phylobayes
bpcomp
. Extended:-c 0
. Branch lengths are the mean over the trees having this branch. - DendroPy
SumTrees
(in Ubuntu repositories and PyPI). Extended:-f 0
. DendroPy is very slow. - I implemented strict MR with branch lengths in Genarium‘s command
MRtree
.
Without branch lengths:
- Phylip
consense
. It is actually a strict consensus tree, i.e. with splits found in all trees. - R
ape::consensus
- R
phangorn::allCompat
-
iqtree -con -t <input set of trees> -minsup 0.5 -bi <burnin>
-
RAxML:
-z <input set of trees> -J MR
or -J MRE
(extended).
Alternatively use -L
for IC supports.
Note: RAxML requires the argument -m MODEL
like -m PROTGAMMALG
, even if unused.
- MrBayes
allcompat
.
I wrapped the R functions above in a
script
so that I just have to call for example ./summarytree.R MRE treelist
for majority-rule extended.
Maximum A Posteriori
The most sampled tree topology (averaged over all branch lengths).
Maximum Clade Credibility
Maximum product of the posterior clade probabilities. Note that split probabilities are not independent, so this method is not giving a real posterior probability.
- DendroPy
SumTrees
(in Ubuntu repositories and PyPI). Summarizes branch lengths. - MrBayes
Sumt
- R
phangorn::maxCladeCred
- TreeAnnotator (Beast). However the selected tree is one that was found in the sample.
95% Credible Set
Smallest set of all tree topologies that accounts for 95% of the posterior probability. Use for very well resolved datasets.
Maximum CCP
CCP approximation: an approximation of the posterior probability of a tree as the product of the Conditional Clade Probabilities, which are measured from the sample of trees.
Observation: if the tree is very unresolved, the result will be very sensitive to sampling (even sampling one tree every two changed the topology in my trials). Also, Heled & Bouckaert (2013)1 surprisingly find that the CCD method is quite bad at finding topologies.
Software:
- I implemented it in Genarium (command
CCP
); - in theory, EcceTERA with the proper weights (but my attempt failed).
- ALE. At first I thought it would be possible
to score a single tree with:
CCPscore alefile treefile
(my attempts crashed). To obtain the optimal tree, the following patch ofALEobserve
can be used: https://github.com/ssolo/ALE/issues/39
Median tree / Minimum distance tree
A distance metric in tree-space must first be defined. Possible tree distances, as used in Heled & Bouckaert 2013 1:
- Robinson-Foulds. As noted above, it is the Majority Rule tree.
- Rooted Branch Score (RBS) and Squared RBS.
- Height Score
- Rooted Agreement Score (RAS)
Software:
- R
phytools::averageTree
Total Clade Branch (TCB)
Tree that maximizes the total length of matching branches in the posterior. In addition to the frequency, using the length is justified by the idea that longer branches are more likely to represent real branches.
Agreement subtrees
- MAST: Maximum Agreement Subtree (Finden & Gordon 1985): remove taxa until all trees agree.
-
Generalized to include all taxa: Cranston & Rannala 2007
-
PAUP
- RogueNaRok, Aberer et al. 2013. I think it might now be included into RaxML.
Quartets
The quartet-based methods allow not only consensus trees, but a generalization for sets of trees with different taxa, i.e. supertrees.
Estabrook, G. F., F. R. McMorris, and C. A. Meacham. 1985.
Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units. Systematic Zoology
References
- https://beast.community/summarizing_trees
- Bryant 2003, A classification of consensus methods for phylogenetics
-
Heled & Bouckaert 2013, Looking for trees in the forest: summary tree from posterior samples. This study evaluates the accuracy of different summary strategies. ↩↩