In this series of posts, I want to put some of my or my collaborators’ results in combinatorics of trees in a concrete form. I want to show how mathematical theorems sometimes can be translated almost verbatim into Haskell code. In the first part, I will discuss the counting results about coalescent histories.
For the necessary background to this post, please check out a small introduction into phylogenetic trees. In that post, I have defined species and gene trees as focal objects of modern phylogenetics. The essence of it is that a species tree describes the natural history of speciation events for a group of species, and a gene tree represents the relationship between genetic lineages sampled from these species. So, the two trees are closely related, but at the same time they can have different shapes (topologies).
A scientist would immediately pose an inference problem: “Can we learn about the species tree given a set of gene trees build from different chunks of DNA?”. That is a very valid question, and dozens of people are still working on in. A mathematician like me, however, wants to ask a different question: “Can we characterize the relationship between the two trees?”
In my opinion, that’s an amazing question because it’s very easy to formalize and prove theorems about. So, let me give you one way to think rigorously about this situation. What is the thing we care about in gene trees? It’s the coalescences, i.e. the nodes of the tree, since these are the events we can “observe” from our data (sample). They are discrete events that happen in the timespan of one generation. In short, in gene trees, we care about the nodes.
No one really observes real speciation events, except for some small-scale experiments. An event is usually declared a speciation only after the fact, when the two new species have sufficiently diverged from each other. So, in species trees, we care about the edges.
Now, our collective brain suddenly clicks: coalescences of the gene tree happen inside branches of the species tree. This looks like a function! This leads us to the main definition of the post.
Definition. Consider a pair of trees \((G,S)\) that are binary, rooted, and leaf-labeled, with the labels in bijective correspondence. Let \(G\) be the gene tree and \(S\) be the species tree. A coalescent history is a function \(\alpha\) from the set of internal nodes of \(G\) to the set of internal edges of \(S\), satisfying two conditions:
For each internal node \(v\) in \(G\), all labels for leaves descended from \(v\) in \(G\) label leaves descended from edge \(\alpha(V)\) in \(S\).
For each pair of internal nodes \(v_1\) and \(v_2\) in \(G\), if \(v_2\) is descended from \(v_1\), then \(\alpha(v_2)\) is descended from \(\alpha(v_1)\) in \(S\).
The conditions in this definition simply ensure that the tree “structure” is preserved: for example, ancestors cannot be mapped to an edge below their descendants. It is worth to take a minute to wrap your head around this definition. Here is a picture to help with it.
Matching case
How many coalescent histories are there? We are going to answer this counting question. Thankfully, all math has already been done for us by Professor Rosenberg. Let’s see what he says in his paper titled “Counting coalescent histories”.
Theorem. (Theorem 3.1, Rosenberg (2007)) For \(m \geq 1\) , the number of valid \(m\)-extended coalescent histories for a tree \(S\) with \(n \geq 2\) taxa is
\[A_{S,m} = \sum_{k=1}^m A_{S_L, k+1} A_{S_R,k+1},\]
where \(S_L\) and \(S_R\) respectively denote the left and right subtrees of \(S\). If \(S\) has \(n = 1\) taxon, then \(A_{S,m} = 1\) for all \(m\).
Okay, let’s figure it all out: what are \(m\)-extended coalescent histories? These are just the regular coalescent histories, but we subdivide the root of the species \(S\) artificially into \(m\) different edges. You can actually see this in Figure 1 above.
Okay, but why do we only have one tree here? This is actually the result for the case when the species tree and gene tree do have the same topology. I will cover the general result below.
Okay, okay, so the answer is just a recursive function? Yes, exactly! Trees are by their nature recursive structures, so it’s no surprise we got this answer. The base case is when the tree has a single leaf edge, no matter how many branches we have subdivided it into, and the big equation defines the recursive case.
Great, so what are the values of this function? Let’s find that out! We are finally ready to code this function.
The final result of this section is available in the companion
repository: github.com/EgorLappo/tree_combinatorics_functions.
It includes a small parser, allowing conversion between Haskell trees
and string representations with parseString
, and a
rudimentary pretty-printing of trees in the form of a custom
Show
instance.
Let’s begin. Since I am using Haskell, I will start with defining our types. Here it is:
data Tree a = Leaf a | Node a (Tree a) (Tree a) deriving Eq
There are many helper functions that we may need in the future, and all of them are located in this file. I won’t reproduce all of them here, but here is an example function:
-- get all leaf labels
leafLabels :: Tree a -> [a]
Leaf a) = [a]
leafLabels (Node _ l r) = leafLabels l ++ leafLabels r leafLabels (
The theorem I am implementing is actually talking about extended trees that have a subdivided root edge. Here is the corresponding definition.
-- data type for m-extended trees
-- an extra `Int` records the subdivision of the branch
data ExtTree a = Leaf Int a | Node Int a (ExtTree a) (ExtTree a)
fromTree :: Tree a -> ExtTree a
T.Leaf a) = Leaf 1 a -- kind of unnecessary in practice...
fromTree (T.Node a l r) = Node 1 a (fromTree l) (fromTree r)
fromTree (
toTree :: ExtTree a -> Tree a
Leaf _ a) = T.Leaf a
toTree (Node _ a l r) = T.Node a (toTree l) (toTree r)
toTree (
labelExt :: ExtTree a -> a
Leaf _ a) = a
labelExt (Node _ a _ _) = a
labelExt (
relabelExt :: Int -> ExtTree a -> ExtTree a
Leaf _ a) = Leaf k a
relabelExt k (Node _ a l r) = Node k a l r relabelExt k (
I have defined a new type of trees, ExtTree
, that keep
information about branch subdivisions as an integer. There is also a
pair of natural transformations between the regular
Tree
s and ExtTree
s, as well as very important
helper functions that work with numberings and labels of the tree
root.
With this small bit of preparation, the theorem takes just a few lines:
-- helper to match the type
countMatchingCoalescentHistories :: Tree a -> Int
=
countMatchingCoalescentHistories . fromTree
countMatchingCoalescentHistories'
-- Rosenberg (2007), Thm 3.1
countMatchingCoalescentHistories' :: ExtTree a -> Int
Leaf _ _) = 1
countMatchingCoalescentHistories' (Node m _ l r) = (sum . zipWith (*)) aL aR
countMatchingCoalescentHistories' (where sLs = [relabelExt (k+1) l | k <- [1..m]]
= [relabelExt (k+1) r | k <- [1..m]]
sRs = map countMatchingCoalescentHistories' sLs
aL = map countMatchingCoalescentHistories' sRs aR
If you know Haskell syntax, this function should be barely different
from the plain-text statement of the theorem. If the tree is a
Leaf
(with any label and subdivision), the result is just
\(1\). If we have a node instead, we
compute the recursive cases and combine them. The
sum . zipWith (*)
function is responsible for the
aggregation of the values computed on subtrees. The values recursively
computes values are aR
and aL
. They are
computed by applying our function to left and right subtrees that have
been extended by k+1
, just as in the statement of
the theorem. Note that by using fromTree
, we make sure that
we count \(1\)-extended coalescent
histories when we call countMatchingCoalescentHistories
,
which is exactly what we were after.
Ok, great! This was simple and easy. Let’s use it to generate some numbers!
> let trees = map parseString ["(a,b)", "((a,b),c)", "(((a,b),c),d)", "((((a,b),c),d),e)"]
ghci
> map countMatchingCoalescentHistories trees
ghci1,2,5,14] [
The numbers 1, 2, 5, 14 are the very famous Catalan numbers! They appear naturally as the number of coalescent histories for “catepillar” trees.
Nonmatching case
Now let’s do the harder case in which the shape of the gene tree is different from the species tree. We have to consult the paper by Dr. Rosenberg once again.
Definition (Def. 4.1, Rosenberg (2007)). For a tree \(S\) with \(n \geq 2\) taxa and a tree \(G\) whose taxa are a subset of those of \(S\), let \(T(G,S)\) denote the minimal displayed tree of \(S\) that is induced by the taxa of \(G\).
Definition (Def. 4.2, Rosenberg (2007)). For a tree \(S\) with \(n \geq 2\) taxa and a tree \(G\) whose taxa are a subset of those of \(S\), let \(d(G,S)\) denote the number of branches of \(S\) that separate the root of \(S\) from the root of \(T(G,S)\). For a given tree \(S\), considering all possible trees \(G\) whose taxa form a subset of those of \(S\), the value of \(d(G,S)\) ranges from \(0\) to the largest number of branches separating a leaf from the root of \(S\).
Theorem (Thm. 4.3, Rosenberg (2007)) For \(m \geq 1\), the number of valid \(m\)-extended coalescent histories for a species tree labeled topology \(S\) with \(n \geq 2\) taxa and a gene tree labeled topology \(G\) whose taxa are a subset of those of \(S\) is
\[ B_{G,S,m} = \sum_{k=1}^m B_{G_L, T(G_L,S),k+d(G_L,S)} B_{G_R, T(G_R,S),k+d(G_R,S)}\]
where \(G_L\) and \(G_R\) respectively denote the left and right subtrees of \(G\). If \(n = 1\) or \(G\) has only one taxon, then \(B_{G,S,m} = 1\) for all \(m\).
Wow, this no longer looks like a simple recursion! Let’s first understand all the new terms. The function \(T(G,S)\) intuitively corresponds to the following operation: find all the leaves in \(G\) among the leaf labels of \(S\), and then return the tree “below” the most recent common ancestor of these leaves in \(S\). The point here is that \(T(G,S)\) may contain other leaves that are not part of \(G\). For example, take \(G=(a,c)\) and \(S=((a,b),c)\). Then, \(T(G,S) = S\) (think through this!).
The function \(d(G,S)\) is pretty self explanatory. It determines how many extra subdivisions of branches we need to introduce to account for nonmatching topologies.
It turns out that, despite having two different trees here, we are still recursing on the gene tree \(G\). The species tree plays an auxiliary function, letting us compute \(T(G,S)\) and \(d(G,S)\). So, it must be possible to elegantly express this with Haskell.
countNonmatchingCoalescentHistories :: Eq a => Tree a -> Tree a-> Int
countNonmatchingCoalescentHistories g s = countNonmatchingCoalescentHistories' (fromTree g) (fromTree s)
countNonmatchingCoalescentHistories' :: Eq a => ExtTree a -> ExtTree a -> Int
Leaf _ _) _ = 1
countNonmatchingCoalescentHistories' (Leaf _ _) = 1
countNonmatchingCoalescentHistories' _ (Node _ _ gl gr) s@(Node m _ _ _)
countNonmatchingCoalescentHistories' (= sum $ zipWith (*) bL bR
where
= (fromTree . mrcaNode (toTree s) . leafLabels . toTree) gl
tl = (fromTree . mrcaNode (toTree s) . leafLabels . toTree) gr
tr = length $ ancestorNodes (labelExt tl) (toTree s)
dl = length $ ancestorNodes (labelExt tr) (toTree s)
dr = [countNonmatchingCoalescentHistories' gl (relabelExt (k + dl) tl) | k <- [1..m]]
bL = [countNonmatchingCoalescentHistories' gr (relabelExt (k + dr) tr) | k <- [1..m]] bR
This is quite a bit more involved than the matching case, but we can still get through it! The first thing to notice is that now we have two “terminating” cases for the recursion: one for \(G\) and one for \(S\).
The second thing is that the structure of the function is the same:
there is the “aggregation pattern” sum . zipWith (*)
and
recursively computed values in lists bL
and
bR
.
To compute \(T(G,S)\), I compose two functions:
= mrcaNode s (leafLabels g) t g s
First, get the leaf labels of g
, and then find their
most recent common ancestor in s
– done! I just need to
wrap this with tree transformations to make the helper functions work
with the extended tree type. To compute \(d(G,S)\) I compute the length of the chain
of ancestors from the root node of \(T(G,S)\) to the root of \(S\), also using a helper function.
The function leafLabels
, mrcaNode
, and
others can be found in this
file.
You can try downloading my repository and running the code yourself! Try to find some more interesting number sequences using the nonmatching function. I hope this little tutorial was useful!