Say you developed a program which uses randomness. How can you test that the randomness is actually uniformly distributed? An example is testing that a mutation function for a genetic algorithm mutates each bit with equal probability. Another example is that a random shuffle of a list results in each item occupying any position in the list with equal probability.

Creating the test

The way to test for uniform distribution is by running the random process multiple times and recording the output each time. Count the number of times each output is given. After many trials, the frequencies should be similar. For example, in order to test if a coin flip is uniformly distributed, count the number of times of each heads and tails. The counts should be similar. In the case of shuffling a list, make a table with a row for each position in the list and a column for each item in the list and after each shuffle, count the number of times each item ended up in each position. The counts across the whole table should be similar.

Here is an example unit test which tests if a dice simulator is uniformly distributed. Assume that there is a function called "similarity" which tells you how similar to each other a bunch of frequencies are.

import unittest class TestDice(unittest.TestCase): def test_uniform(self): trials = 1000000 best_similarity = similarity({ 1 : trials//6, 2 : trials//6, 3 : trials//6, 4 : trials//6, 5 : trials//6, 6 : trials//6 }.values()) threshold = 0.9*best_similarity frequencies = { 1 : 0, 2 : 0, 3 : 0, 4 : 0, 5 : 0, 6 : 0 } for i in range(trials): outcome = roll_dice() frequencies[outcome] += 1 self.assertTrue(similarity(frequencies.values()) >= threshold)

This program starts by creating a threshold against which to compare the similarity of the frequency list. This threshold is 90% of the similarity measure when the frequencies are all the same. The percentage can be tweaked of course. The similarity between the actual frequencies must be at least as high as this threshold for the test case to pass.

It could be that case that instead of a similarity function there is a "distance" function which is 0 when all frequencies are the same and greater otherwise. In this case the code should be a little different.

import unittest class TestDice(unittest.TestCase): def test_uniform(self): trials = 1000000 worst_distance = distance({ 1 : trials, 2 : 0, 3 : 0, 4 : 0, 5 : 0, 6 : 0 }.values()) threshold = 0.1*best_distance frequencies = { 1 : 0, 2 : 0, 3 : 0, 4 : 0, 5 : 0, 6 : 0 } for i in range(trials): outcome = roll_dice() frequencies[outcome] += 1 self.assertTrue(distance(frequencies.values()) <= threshold)In this case instead of comparing against the best similarity, we compare against the worst similarity which is when only one of the outcomes gets all of the frequency. The threshold would then be 10% of this worst measure. The distance between the actual frequencies must not exceed this threshold for the test case to pass.

Measuring the distance/similarity

The problem is how to measure the similarity between a list of frequencies, in such a way that the similarity is quantified into a single number by a program (a unit test) in order to check if it is greater than or smaller than some threshold. Here are three methods to quantify the similarity between a list of frequencies.Standard Deviation

The average of a bunch of frequencies is a number which is nearest to every frequency. But an average is only half of the story, it is also important to know how close the frequencies are to this average. To measure how close all the frequencies are to their average you need to find their standard deviation, which is basically the average difference between every frequency and their average. It is found in the following way:distance(xs) = sqrt( average( (x - average(xs))^2 for x in xs ) )The closer they are to their average, the more similar they are. So the smaller the standard deviation, the more similar the frequencies are; hence, it is a distance measure rather than a similarity measure.

Chi-squared test

Let's say that you're conducting an experiment to measure the effects of smoking cigarettes on a person's health. You take two groups of people, one group consisting of smokers and another group consisting of non-smokers, and you measure their collective health. I have no idea how to measure collective health but let's say that you have two numbers measure how healthy each group is. How do you quantify how similar the two numbers are in order to determine how significant the effect of smoking is on a person's health?A common way to do this in science is by using Pearson's Chi-squared test which is used to measure how similar a list of pairs of numbers are. In order to apply this to similarity between frequencies, we can measure how similar each frequency is to their average.

distance(xs) = sum( (x - average(xs))^2 / average(xs) for x in xs )This number will be smaller the more similar the frequencies are, making this a distance measure rather than a similarity measure.

Min max ratio

Of course there's a less fancy way to measure how similar a bunch of frequencies are, and that is by measuring the difference between the largest and smallest frequency. If the largest and smallest numbers are the same then all the numbers are the same. If they are different then that is the largest difference between any two numbers.The difference between maximum and minimum isn't a useful measure of similarity since the similarity between 10 and 5 is not the same as the similarity between 1000 and 995. Instead it is more useful to divide the minimum by the maximum. But this will result in a value of zero whenever one of the frequencies is zero (and hence the minimum is zero), which is not measuring anything useful either. A solution for this is to add one to both the minimum and maximum frequency in order to avoid having zeros but still have a maximum similarity of 1.

similarity(xs) = (min(xs) + 1)/(max(xs) + 1)

Apart from this function giving a larger number the more similar the frequencies are, the difference between this measure and the others is that if all the frequencies are the same except for one, this method will identify the frequencies as different, whereas the others will be affected less. This might be considered an advantage when checking if a random process is uniformly distributed.