I am Michael Walker, a Ph.D student at the University of York working on concurrency testing. This post is about one of my favourite talks from the Haskell Symposium this year, Ode on a Random Urn, a functional pearl by Leonidas Lampropoulos, Antal Spector-Zabusky, and Kenneth Foner. It’s about a good data structure for discrete probability distributions.
A common representation for discrete probability distributions is just a list of weights, like so:
type Distribution a = [(Int, a)]
We can sample it in linear time, by generating an integer in the range
[0, sum of weights) and indexing into the distribution:
index :: Distribution a -> Int -> a index ((weight, a):as) i | i <= weight = a | otherwise = index as (i - weight) index  _ = error "index out of range" total :: Distribution a -> Int total = sum . map snd sample :: RandomGen g => Distribution a -> g -> (a, g) sample distrib g = let (i, g') = randomR (0, total distrib - 1) g in (index distrib i, g')
This is linear time, which is probably good enough for small distributions.
Another way of representing discrete probability distributions is to use a binary tree:
data Distribution a = Leaf Int a | Branch Int (Distribution a) (Distribution a)
Indexing just descends the tree, going left or right as appropriate:
index :: Distribution a -> Int -> a index (Leaf _ a) _ = a index (Branch weight left right) i | i <= weight = index left i | otherwise = index right (i - weight) total :: Distribution a -> Int total (Leaf weight _) = weight total (Branch weight _ _) = weight sample :: RandomGen g => Distribution a -> g -> (a, g) sample distrib g = let (i, g') = randomR (0, total distrib - 1) g in (index distrib i, g')
If we can ensure that the tree is balanced, this is log time. But how do we construct this balanced binary tree?
We can’t just insert (or remove) elements anywhere in the tree, as that can destroy the balance. Fortunately, we have a bit of leeway here due to the problem domain: we don’t care about the order of the values in the tree. This is the key insight.
If we keep track of the size of the tree, we can interpret the binary representation of that size as a path: 0 for left, 1 for right, starting from the least-significant bit. Using these paths when inserting ensures that new elements are evenly distributed through the tree, keeping it balanced:
data Urn a = Urn Word (Distribution a) insert :: Int -> a -> Urn a -> Urn a insert w' a' (Urn size tree) = Urn (size + 1) (go size tree) where go _ l@(Leaf w a) = Branch (w + w') l (Leaf w' a') go path (Node w l r) = let path' = path `shiftR` 1 in if path `testBit` 0 then Node (w + w') l (go path' r) else Node (w + w') (go path' l) r |
The final piece of the puzzle is deletion while keeping the tree balanced, but I’ll let you look up the (very clear!) explanation in the paper.
Things look pretty good compared to lists:
|Operation||Worst Case (List)||Worst Case (Urn)|
And things are pretty good in practice, too. The paper has a few examples:
- An alternative to QuickCheck’s
frequencycombinator, which samples a discrete distribution
- An efficient backtracking combinator, used in generating lambda terms of a given type
- Some comparisons with existing Haskell random libraries
There you have it. I particularly liked this talk because it’s a simple idea explained clearly, and it works very well. A true pearl.