Home >>
ICFP 2017
Sun 3 - Sat 9 September 2017 Oxford, United Kingdom

Log-time Sampling and Updating for Discrete DistributionsSat 9 Sep 2017

I am Michael Walker, a Ph.D student at the University of York working on concurrency testing. This post is about one of my favourite talks from the Haskell Symposium this year, Ode on a Random Urn, a functional pearl by Leonidas Lampropoulos, Antal Spector-Zabusky, and Kenneth Foner. It’s about a good data structure for discrete probability distributions.

Lists of Weights

A common representation for discrete probability distributions is just a list of weights, like so:

type Distribution a = [(Int, a)]

We can sample it in linear time, by generating an integer in the range [0, sum of weights) and indexing into the distribution:

index :: Distribution a -> Int -> a
index ((weight, a):as) i
  | i <= weight = a
  | otherwise   = index as (i - weight)
index [] _ = error "index out of range"

total :: Distribution a -> Int
total = sum . map snd

sample :: RandomGen g => Distribution a -> g -> (a, g)
sample distrib g =
  let (i, g') = randomR (0, total distrib - 1) g
  in (index distrib i, g')

This is linear time, which is probably good enough for small distributions.

Balanced Trees

Another way of representing discrete probability distributions is to use a binary tree:

data Distribution a
  = Leaf Int a
  | Branch Int (Distribution a) (Distribution a)

Indexing just descends the tree, going left or right as appropriate:

index :: Distribution a -> Int -> a
index (Leaf _ a) _ = a
index (Branch weight left right) i
  | i <= weight = index left i
  | otherwise = index right (i - weight)

total :: Distribution a -> Int
total (Leaf   weight _)   = weight
total (Branch weight _ _) = weight

sample :: RandomGen g => Distribution a -> g -> (a, g)
  sample distrib g =
  let (i, g') = randomR (0, total distrib - 1) g
  in (index distrib i, g')

If we can ensure that the tree is balanced, this is log time. But how do we construct this balanced binary tree?

Urns

We can’t just insert (or remove) elements anywhere in the tree, as that can destroy the balance. Fortunately, we have a bit of leeway here due to the problem domain: we don’t care about the order of the values in the tree. This is the key insight.

If we keep track of the size of the tree, we can interpret the binary representation of that size as a path: 0 for left, 1 for right, starting from the least-significant bit. Using these paths when inserting ensures that new elements are evenly distributed through the tree, keeping it balanced:

data Urn a = Urn Word (Distribution a)

insert :: Int -> a -> Urn a -> Urn a
insert w' a' (Urn size tree) = Urn (size + 1) (go size tree) where
  go _ l@(Leaf w a) = Branch (w + w') l (Leaf w' a')
  go path (Node w l r) =
    let path' = path `shiftR` 1
    in if path `testBit` 0
       then Node (w + w') l (go path' r)
       else Node (w + w') (go path' l) r   |

The final piece of the puzzle is deletion while keeping the tree balanced, but I’ll let you look up the (very clear!) explanation in the paper.

Things look pretty good compared to lists:

Operation Worst Case (List) Worst Case (Urn)
Insert O(1) O(log n)
Sample O(n) O(log n)
Update O(n) O(log n)

And things are pretty good in practice, too. The paper has a few examples:

  • An alternative to QuickCheck’s frequency combinator, which samples a discrete distribution
  • An efficient backtracking combinator, used in generating lambda terms of a given type
  • Some comparisons with existing Haskell random libraries

There you have it. I particularly liked this talk because it’s a simple idea explained clearly, and it works very well. A true pearl.