April 2004 archive

Latin Hypercube Sampling

Introduction

Latin hypercube sampling (LHS) is a form of stratified sampling that can be applied to multiple variables. The method commonly used to reduce the number or runs necessary for a Monte Carlo simulation to achieve a reasonably accurate random distribution. LHS can be incorporated into an existing Monte Carlo model fairly easily, and work with variables following any analytical probability distribution.

Monte-Carlo simulations provide statistical answers to problems by performing many calculations with randomized variables, and analyzing the trends in the output data. There are many resources available describing Monte-Carlo (history, examples, software).

The concept behind LHS is not overly complex. Variables are sampled using a even sampling method, and then randomly combined sets of those variables are used for one calculation of the target function.

The sampling algorithm ensures that the distribution function is sampled evenly, but still with the same probability trend. Figure 1 and figure 2 demonstrate the difference between a pure random sampling and a stratified sampling of a log-normal distribution. (These figures were generated using different versions of the same software. Differences within the plot, such as the left axis label and the black lines, are due to ongoing development of the software application and are not related to the issue being demonstrated.)

stratified-cfp.png

Figure 1. A cumulative frequency plot of "recovery factor", which was log-normally distributed with a mean of 60% and a standard deviation of 5%. 500 samples were taken using the stratified sampling method described here, which generated a very smooth curve.

random-cfp.png

Figure 2. A cumulative frequency plot of "recovery factor", which was log-normally distributed with a mean of 60% and a standard deviation of 5%. 500 random samples were taken.

Process

Sampling

To perform the stratified sampling, the cumulative probability (100%) is divided into segments, one for each iteration of the Monte Carlo simulation. A probability is randomly picked within each segment using a uniform distribution, and then mapped to the correct representative value in of the variable’s actual distribution.

A simulation with 500 iterations would split the probability into 500 segments, each representing 0.2% of the total distribution. For the first segment, a number would be chosen between 0.0% and 0.2%. For the second segment, a number would be chosen between 0.2% and 0.4%. This number would be used to calculate the actual variable value based upon its distribution.

P(X <= x) = n is solved for x, where n is the random point in the segment. How this is done is different for every distribution, but it’s generally just a matter of reversing the process of the probability function.

Grouping

Once each variable has been sampled using this method, a random grouping of variables is selected for each Monte Carlo calculation. Independant uniform selection is done on each of the variable’s generated values. Each value must only be used once.

Example

Average Joe has been offered free wood scraps from the local lumber yard. They tend to cut pieces up for customers and always have leftovers that nobody is interested in because the ends are damaged, or other contrived reasons. Joe wants to build U shaped picture frames out of these pieces, and he’s interested in knowing how probable it is that he can build a picture frame big enough for some strange purpose.

Joe’s odd U shaped picture frames require three pieces of wood. A base, and two sides. He’s going to make the base out of a chunk of 2-by-4, and the sides out of baseboards. The 2-by-4 (our first variable) comes in pieces 4cm to 40cm long, and the baseboards come in pieces 2cm to 60cm long. All possibilities are equally likely (uniform distribution).

We’re going to run this simulation with 10 iterations. (A common practice for Monte Carlo simulations is to start with about a thousand iterations, and work upwards depending upon the complexity of the problem.) We’re also going to use Python to do our simulation.

Sampling:


from __future__ import division
import random
iterations = 10
segmentSize = 1 / iterations
for i in range(iterations):
    segmentMin = i * segmentSize
    #segmentMax = (i+1) * segmentSize
    point = segmentMin + (random.random() * segmentSize)
    pointValue = (point * (variableMax - variableMin)) + variableMin

For our 2-by-4s, this might provide values like this:

Sample Segment Min Segment Max Segment Point Value
0 0% 10% 2.132 % 4.767
1 10% 20% 13.837 % 8.981
2 20% 30% 24.227 % 12.722
3 30% 40% 33.716 % 16.137
4 40% 50% 47.659 % 21.157
5 50% 60% 53.636 % 23.308
6 60% 70% 69.006 % 28.842
7 70% 80% 73.106 % 30.318
8 80% 90% 88.888 % 35.999
9 90% 100% 90.812 % 36.692

Once Joe has all three runs of the sampling for each variable, he’ll have a set of output like this:

Sample Variable 1 Variable 2 Variable 3
0 4.767 7.483 4.719
1 8.981 7.731 9.417
2 12.722 13.150 11.852
3 16.137 18.347 15.136
4 21.157 21.233 21.786
5 23.308 23.382 22.630
6 28.842 26.008 26.180
7 30.318 32.452 31.977
8 35.999 36.129 34.198
9 36.692 37.482 39.968

As you can see, even though both variable 2 and variable 3 have the same distribution (they’re the baseboards), the stratified sampling does not cause them to have the same numbers. It just causes a more evenly distributed set of numbers, eliminating the possibility of all 10 samples being the exact same number. In Monte Carlo simulation, we want the entire distribution to be used evenly. We usually use a large number of samples to reduce actual randomness, but the latin hypercube sampling permits us to get the ideal randomness without so much of a calculation.

The only step left is to shuffle these variables around to create some calculations. Joe’s software might randomly pick iteration 9 for variable 1, iteration 3 for variable 2, and iteration 5 for variable 3. This results in a set of (35.999, 13.150, 21.786) for the sizes of the 2-by-4s and the left and right baseboard. (Joe’s got a lopsized U! Not entirely unexpected. He should have been the guy buying the lumber first, the cheap bastard.)

Each number should be discarded after it is used, and others selected for the next calculation.

An alternative at this point is to run one case with each of the possible variable sets, creating 1000 calculations for these simple 10 variables. For complex cases involving many variables and many samplings, this can balloon to unreasonable numbers of calculations quickly. This eliminates all of the benefits that Monte Carlo simulation provides, while still providing approximately the same answer.

Conclusion

Latin hypercube sampling is capable of reducing the number of runs necessary to stablize a Monte Carlo simulation by a huge factor. Some simulations may take up to 30% fewer calculations to create a smooth distribution of outputs. The process is quick, simple, and easy to implement. It helps ensure that the Monte Carlo simulation is run over the entire length of the variable distributions, taking even unlikely extremities into account as one would desire.

The MutableNumber Class Revisited

Ian Bicking’s comment suggesting I wasn’t far away from getting the MutableNumber class to work was quite correct. I had never considered using a descriptor, which made it difficult to make this work with a metaclass. The following code, though, does work:


class Operator(object):
    def __init__(self, instance, op):
        self.instance = instance
        self.op = op
    def __call__(self, rhs):
        try:
            return self.op(self.instance.number, rhs.number)
        except AttributeError:
            return self.op(self.instance.number, rhs)

class OperatorDescriptor(object):
    def __init__(self, op):
        self.op = op
    def __get__(self, obj, type=None):
        return Operator(obj, self.op)
    def __set__(self, obj, value):
        raise TypeError
    def __delete__(self, obj):
        raise TypeError

class NumberDelegateMetaclass(type):
    def __init__(cls, name, bases, dct):
        for attr in cls.__operatormap__:
            setattr(cls, attr, OperatorDescriptor(getattr(operator, attr)))

class MutableNumber(object):
    number = None

    __metaclass__ = NumberDelegateMetaclass
    __operatormap__ = "__add__", "__div__", "__eq__", "__floordiv__", "__ge__", \
        "__gt__", "__inv__", "__invert__", "__le__", "__lshift__", "__lt__", \
        "__mod__", "__mul__", "__ne__", "__neg__", "__pos__", "__pow__", \
        "__rshift__", "__sub__", "__truediv__"
    def __init__(self, value=None):
        self.number = value

It’s a miracle.

Overriding Operators in New-Style Classes

Occasionally, even Python violates the principal of ‘least surprise’ – that is, software should do what you expect it to do. Overriding mathematics operators in new-style classes led me to find one such violation, and investigate the reason behind it.

I setup myself a make-work project a few days ago when I probably should have done a dozen other things. After reading a posting to python-dev entitled Python is faster than C, I was motivated to create a pure-python implementation of a matrix class. I was interested in building it primarily as a refresher of matrix mathematics, and also to experiment with psyco‘s x86 optimization capabilities.

I ran into a quirk of new-style classes. I envisioned the Matrix class having two hidden data fields, one to hold a tuple of Vector objects representing each row, and a second one to hold a tuple of Vector objects representing each column. This would allow for easy access to the data based upon rows or columns, whichever was more convenient, with no speed penalty. Each tuple of Vectors ought to point to the same numbers, though. How does one do this with Python?

I’d like self.__cdata[0][0] and self.__rdata[0][0] to always be the same number, even if I only change one. I could manage this with accessor methods that make sure they’re the same, but I decided to wrap each number in an object called a MutableNumber. The class was setup like so:


class MutableNumber(object):
    def __init__(self, value = None):
        self.number = value

Every mathematical operator should have then been overridden like this:


...
    def __mul__(self, rhs):
        try:
            return MutableNumber(self.number * rhs.number)
        except AttributeError:
            return MutableNumber(self.number * rhs)
...

What a pain to type and maintain, though. What if I wanted to add another possibility in the future, or change the operator to not return a new MutableNumber object wrapping the number? There are twenty-two operators I wanted to override (eg. __mul__, __eq__, __div__, __floordiv__…), and I didn’t want to have to type them, even once.

I originally setup a meta-class to delegate every access to a property in the class’s __delegateattributes__ member to self.number. This had some issues:


>>> MutableNumber(2) * 3
6
>>> MutableNumber(2) * MutableNumber(3)
Traceback (most recent call last):
  ...
TypeError: unsupported operand types for *: 'int' and 'MutableNumber'

I had to go a more complex route, rather than just pure delegation. I eventually replaced all the operators with custom instances of an Operator class, which held an operator and a reference to the instance of the class:


class MutableNumber(object):
    __operatormap__ = "__add__", "__div__", "__eq__", "__floordiv__", "__ge__", \
        "__gt__", "__inv__", "__invert__", "__le__", "__lshift__", "__lt__", \
        "__mod__", "__mul__", "__ne__", "__neg__", "__pos__", "__pow__", \
        "__rshift__", "__sub__", "__truediv__"
    class Operator(object):
        def __init__(self, instance, op):
            self.instance = instance
            self.op = op
        def __call__(self, rhs):
            op = getattr(operator, self.op)
            try:
                return op(self.instance.number, rhs.number)
            except AttributeError:
                return op(self.instance.number, rhs)

    def __init__(self, value=None):
        for op in self.__operatormap__:
            setattr(self, op, MutableNumber.Operator(self, op))
        self.number = value

So, we finally reach the part where it gets interesting. What happens? Does this work properly? The code seems good, but it fails to function:


>>> MutableNumber(3) * 2
Traceback (most recent call last):
  ...
TypeError: unsupported operand type(s) for *: 'MutableNumber' and 'int'

Unsupported operand for types? It looks like it’s totally ignoring the fact that the MutableNumber instance has a __mul__ property set. Hm… stranger still:


>>> MutableNumber(3).__mul__(2)
6
>>> MutableNumber(3).__class__.__mul__(2)
Traceback (most recent call last):
  ...
AttributeError: type object 'MutableNumber' has no attribute '__mul__'

Well, it seems that A * B does not call A.__mul__(B). It searches for A.__class__.__mul__, but not A.__mul__. It works fine with an old-style class, but not a new-style class.

It turns out this behaviour has been discovered before. In fact, Jan Decaluwe was trying to do the same thing I am, create an instance object that acts like a mutable number. Alex Martelli wrote:

The idea is that operators should rely on special
methods defined by the TYPE (class) of the object they’re working
on, NOT on special methods defined just by the OBJECT itself and
not by its class. Doing otherwise (as old-style classes did, and
still do for compatibility) is untenable in the general case, for
example whenever you consider a class and its metaclass (given that
the metaclass IS the class object’s type, no more and no less).

Well, too bad. I guess maybe I’ll just backtrack all the way to using accessor methods to make sure the column and row data is both kept constant… or just store one of them.