Geeky is Awesome: probability

Showing posts with label probability. Show all posts

Wednesday, January 27, 2021

The secretary problem

Say you're dating several partners and want to get into a serious relationship eventually. After how many partners can you have a good idea of what a good partner is like? This is called the secretary problem, and it can be formalised as follows:

Let $n$ be the number of partners in total you can date. Let $A_i$ be the fitness of the $i$th partner, such that the order of the partners is random and we are interested in the partner with the maximum fitness. Let $r$ be the number of partners you date before you start looking for a serious relationship. Once a partner has been evaluated, you cannot go back to them if you try the next partner. What you're doing is going through all partners $i$ starting from 1 up to $r$, at which point you will take $\max_{i=1}^{r}(A_i)$ as a reference of what a good partner looks like and continue going through the rest of the partners $r$ to $n$ and stop as soon as $A_i$ is greater than the reference ($\max_{i=1}^{r}(A_i)$). If none of the partners are better than the best of the first $r-1$ then you settle for the last partner. The question is, what should $r$ be such that the probability of picking the partner with maximum $A_i$ is maximised.

The choice of $r$ actually makes a significant difference, as we can show in the following simulation in Python 3 (in Python everything needs to be 0-based such that the first item is at position 0 rather than 1):

import random

def select(A, r):
    ref = max(A[:r]) # first r items
    for i in range(r, len(A)): # rest of the items
        if A[i] > ref:
            return i
    return len(A)-1 # return last item if search failed

n = 20
for r in range(1, n): # first item not included
    successes = 0
    for _ in range(10000):
        A = [random.random() for _ in range(n)]
        best = max(range(n), key=lambda i:A[i])
        choice = select(A, r)
        if choice == best:
            successes += 1
    print(r+1, successes/10000)

These are the fraction of times the best partner is selected out of 10000 simulations for each $r$ from 2 to 20. You will notice that the fraction steadily increases up to a peak at 8 and then steadily decreases until the end. It's interesting that the peak is not in the middle of the list but closer to the beginning. How can we find what the best $r$ would be for any $n$?

First, we need to find the probability of choosing the best partner for a given $r$ and $n$. This probability can be expressed as two events happening together: you stopping at element $i+1$ and the maximum element being element $i+1$ (we're using $i+1$ instead of just $i$ because it will be mathematically convenient later). These two events are independent and so we can find the probability of each one and then multiply them together.

The probability of the maximum element from all $n$ elements being the $i+1$th element is $\frac{1}{n}$.

The other probability is a little trickier. To stop at element $i+1$, all the elements prior to $i+1$ have to be less than the reference we got from the first $r$ elements. If a larger element is found sooner, we would have stopped there. Therefore, we are interested in the probability that the maximum element from the first $i$ elements is among the first $r$ elements, which is $\frac{r}{i}$. This is assuming that $i+1 \geq r+1$, since we cannot choose any partner before the $r+1$th partner.

Multiplying these two probabilities, we get $\frac{r}{n \times i}$. But this is the probability for a given $i+1$, whereas we want the probability for any $i+1$. Each $i+1$ is mutually exclusive so we can add up the probability of each $i+1$. Given that $i+1$ must be greater than or equal to $r+1$ by definition,

$$P(r, n) = \sum_{i=r}^{n}\frac{r}{n \times i} = \frac{r}{n}\sum_{i=r}^{n}\frac{1}{i}$$

Great, so now we have the probability of making a successful selection given $r$ and $n$. Now how do we find at what $r$ is this maximum? Normally we would find the derivative of this equation with respect to $r$ and see what $r$ has to be for the derivative to be 0 (the turning point), but we can't differentiate this expression because $r$ is discrete. We also cannot treat $r$ as continuous because it is used as the starting point in the summation. But if we focus on finding $\frac{r}{n}$ rather than $r$, then we can do so in a round about way.

Start by multiplying $\frac{1}{n}$ with the expression like this:

$$\frac{r}{n}\sum_{i=r}^{n}\frac{n}{i} \frac{1}{n}$$

We can do this since $n$ cannot be zero. Now what will this expression look like if we let $n$ go to infinity?

$$\lim_{n \to \infty} \frac{r}{n}\sum_{i=r}^{n}\frac{n}{i} \frac{1}{n}$$

Let $x = \lim_{n \to \infty} \frac{r}{n}$ and $t = \lim_{n \to \infty} \frac{i}{n}$. In this case, the summation looks like a Reinmann sum, that is, it is adding together the areas of equally wide rectangular strips in a fixed region under a graph. As the width of the strips goes to zero, the total area of all the strips equals the area under the graph (in the fixed region), making it equal to the definite integral of the graph. In this case, the graph is $f(t) = \frac{1}{t}$, the width of the strips is $\frac{1}{n}$, and the region is between $\frac{r}{n}$ and $\frac{n}{n}$. This means the following:

$$\lim_{n \to \infty} \frac{r}{n}\sum_{i=r}^{n}\frac{n}{i} \frac{1}{n} = x\int_x^1 \frac{1}{t} dt$$

Great. So now we need to find the turning point of $x\int_x^1 \frac{1}{t} dt$ with respect to $x$ in order to know what $\frac{r}{n}$ gives the most probable success. This comes as follows:

$$ \begin{align} \frac{d}{dx} (x\int_x^1 \frac{1}{t} dt) &= 0 \\ \frac{d}{dx} x(\ln(1) - \ln(x)) &= 0 \\ \frac{d}{dx} -x\ln(x) &= 0 \\ -\ln(x) - 1 &= 0 \\ x &= \frac{1}{e} \\ \end{align} $$

Which says that the number of partners you should date before looking for a serious relationship is the total number of partners you will date divided by $\frac{1}{e}$, or about 37% of the number of people you'll go out with.

Sunday, March 17, 2019

The Gambler's Fallacy: Why after tossing 100 heads in a row you can still get heads again

The Gambler's fallacy is a deceptively intuitive fallacy about how randomness works. Imagine you're tossing a fair coin repeatedly. You're astonished to see that every toss is resulting in heads. Heads, heads, heads, heads, over and over. As the number of consequtive heads keeps going up, does the probability of getting tails start increasing? It seems reasonable to conclude this as the number of heads and tails should balance each other out given that the coin is fair, as is stated by the Law of Large Numbers. But the idea that there is some kind of memory in the coin that is used to prevent the fair coin from acting as if it is unfair is wrong.

First, let's show that the Gambler's Fallacy is actually a fallacy by simulating a bunch of coin tosses in a program. We'll keep track of all the times a streak of 10 heads was obtained and the following toss after it and see if the number of heads following the streak is significantly less than the number of tails. We do not consider 11 heads in a row as two streaks of 10 heads but instead start searching for a completely new streak after every streak. Here is a Python 3 program to do that (we also count the number of heads and tails obtained in general to show that the simulated coin is not biased).

import random

seed = 0
num_trails = 1000000
head_streak_length_limit = 10

random.seed(seed)

num_heads = 0
num_tails = 0
heads_streak_length = 0
next_was_head = 0
next_was_tail = 0
for trial in range(num_trails):
    result = random.choice(['head', 'tail'])

    if result == 'head':
        num_heads += 1
    else:
        num_tails += 1

    if heads_streak_length < head_streak_length_limit:
        if result == 'head':
            heads_streak_length += 1
        else:
            heads_streak_length = 0
    else:
        if result == 'head':
            next_was_head += 1
        else:
            next_was_tail += 1

        heads_streak_length = 0

print('heads / tails:', round(num_heads/num_tails, 4))
print('after heads streak was head / after heads streak was tail:', round(next_was_head/next_was_tail, 4))

And here are the results:

heads / tails: 1.002
after heads streak was head / after heads streak was tail: 1.0579

Both the ratio of heads and tails in general and the ratio of heads and tails following a streak of 10 heads in a row are close to 1. You can try changing the streak length limit to see how making it shorter does not significantly change this ratio.

So why does this happen? It's a simple matter of probability theory. Both the probability of getting a head and of getting a tail is 0.5 and each toss is independent from the previous one. Therefore the probability of getting a head after 10 heads is $0.5^{10} \times 0.5 = 0.5^{11}$ and the probability of getting a tail after 10 heads is also $0.5^{10} \times 0.5 = 0.5^{11}$. It does not make a difference what the previous tosses were, the next toss being heads will always have a probability of 0.5, so both heads and tails are equally likely.

So then the more interesting question is why do we think otherwise? Why do we think that as the chain of consecutive heads gets longer, the probability of getting another head goes down? It could be because we expect the coin to be fair and a coin that never gives tails is a biased coin, but that doesn't really mean anything in terms of predicting what the next toss will be. It is entirely possible that the fairest possible coin will give a million heads in a row. There is nothing stopping this from happening. Not only that, but a million heads in a row is just as likely to happen as any other random looking sequence resulting from a million coin tosses. Let's write down a sequence of heads and tails as a string of 'H's and 'T's. The sequence 'HHHHHH' is just as likely to come up as the sequence 'HHHTTH', yet we think of the first sequence as being unlikely whilst of the second sequence as... what? Is the second sequence more likely to come up than the first? Would you bet more money on the second sequence than the first? This is what what I think is the root cause of the Gambler's Fallacy.

What I think happens in our mind is that we don't really predict whether the next toss will give a heads or a tails but whether the whole sequence will turn out to be an 'interesting' looking sequence or not. We love patterns, and when we see that a random sequence generator (coin tosses in this case) is giving out a sequence that looks like a simple pattern, we become suspicious of the randomness behind the sequence generator and assume that it isn't really random. I mean we're pretty terrible at generating random sequences manually because we assume that in a random process you will not get things like a streak of heads. But it's not really just a streak of heads that would make us suspicious isn't it? It's also a sequence of alternating heads and tails for example or a long sequence of heads followed by a long sequence of tails. I think that the source of the Gambler's Fallacy is that we categorise entire sequences into interesting and uninteresting sequences. As we watch a sequence being generated one toss at a time we expect that interesting sequences of that length are much less likely to happen that uninteresting ones, which is true. This makes us expect the next toss to break the interestingness of the current sequence, which, in the case of a streak of heads, is by the next toss turning out to be tails. Of course, although interesting sequences are less likely to happen as the length grows, given that the probability of the interesting pattern becoming uninteresting on the next toss is equal to the probability of it remaining interesting, it is still an incorrect assumption.

So what is the probability of an interesting sequence? This is a subjective question but we can still try to investigate it. Ideally a survey is conducted on a population of people to check what is considered as interesting by people on average. But that would be weird so I'll just conduct the survey on myself. I have generated all the possible coin tosses that can be obtained after 4 coin tosses and I will be annotating each sequence according to what I find interesting in it. I also did the same on 6 coin tosses to see how the ratio of interesting to uninteresting sequences changes with sequence length.

Here is what I find interesting in sequences of 4 coin tosses.

TTTT	same
TTTH	not interesting
TTHT	not interesting
TTHH	switched
THTT	not interesting
THTH	repeated
THHT	mirrored
THHH	not interesting
HTTT	not interesting
HTTH	mirrored
HTHT	repeated
HTHH	not interesting
HHTT	switched
HHTH	not interesting
HHHT	not interesting
HHHH	same

And here is what I find interesting in sequences of 6 coin tosses (organised into two columns to reduce vertical space).

TTTTTT	same	TTTTTH	not interesting
TTTTHT	not interesting	TTTTHH	not interesting
TTTHTT	not interesting	TTTHTH	not interesting
TTTHHT	not interesting	TTTHHH	switched
TTHTTT	not interesting	TTHTTH	doubled
TTHTHT	not interesting	TTHTHH	not interesting
TTHHTT	mirrored	TTHHTH	not interesting
TTHHHT	not interesting	TTHHHH	not interesting
THTTTT	not interesting	THTTTH	not interesting
THTTHT	mirrored	THTTHH	not interesting
THTHTT	not interesting	THTHTH	alternated
THTHHT	not interesting	THTHHH	not interesting
THHTTT	not interesting	THHTTH	not interesting
THHTHT	not interesting	THHTHH	doubled
THHHTT	not interesting	THHHTH	not interesting
THHHHT	mirrored	THHHHH	not interesting
HTTTTT	not interesting	HTTTTH	mirrored
HTTTHT	not interesting	HTTTHH	not interesting
HTTHTT	doubled	HTTHTH	not interesting
HTTHHT	don't know	HTTHHH	not interesting
HTHTTT	not interesting	HTHTTH	not interesting
HTHTHT	alternated	HTHTHH	not interesting
HTHHTT	not interesting	HTHHTH	mirrored
HTHHHT	not interesting	HTHHHH	not interesting
HHTTTT	not interesting	HHTTTH	not interesting
HHTTHT	not interesting	HHTTHH	mirrored
HHTHTT	not interesting	HHTHTH	not interesting
HHTHHT	doubled	HHTHHH	not interesting
HHHTTT	switched	HHHTTH	not interesting
HHHTHT	not interesting	HHHTHH	not interesting
HHHHTT	not interesting	HHHHTH	not interesting
HHHHHT	not interesting	HHHHHH	same

Among the 4 coin toss sequences, 8 out of the 16 sequences were considered interesting, meaning that you have a 50% chance of generating an interesting sequence. But among the 6 coin toss sequences, only 16 out of the 64 sequences were considered interesting, meaning that you have a 25% chance of generating an interesting sequence. Now I don't know if the patterns I identified in the tables above are the only interesting patterns for any sequence length, since maybe some patterns only emerge in very long sequences, but it is safe to say that the longer the sequence, the greater the proportion of uninteresting and random looking sequences will be. Therefore, the longer the sequence, the less likely that a sequence will be interesting. This might be what is going on when we think that the probability of getting a tails goes up as the streak of heads gets longer; we'd really be comparing how likely it is for a sequence of a given length to be uninteresting, which becomes less likely as the length grows.

Interesting patterns are interesting

I have to admit that whilst annotating interesting sequences I was more interested in being consistent in my reasoning than in saying what I genuinely found interesting. This is because I didn't actually feel that, for example, all "repeated" patterns were interesting, just some of them. But I wanted to avoid explaining myself too much based on what I felt. This means that there might be more going on in feeling that a sequence is aesthetically pleasing than simple mathematical patterns. It would be interesting if someone made an experiment where people are subjected to an EEG to see what is really considered interesting and what is not. How knows, maybe it might even turn out to be a kind of Rorschach test where different patterns of interest indicate different personalities. And maybe it might even be an interesting machine learning challenge to try to make a program predict which sequences would be considered interesting by humans in order to quantify how interesting a given sequence is.

Thursday, December 27, 2018

Language models and the unknown token: Why perplexity gets worse as the corpus size increases

Unless you're making use of a character-based language model somehow, you're going to have a finite vocabulary of words that you can handle and so you need a way to handle out-of-vocabulary words. One common way to handle such words is by replacing them all with the unknown token, a pseudo word that replaces all out-of-vocabulary words. The unknown token has several advantages: Infrequent words in the training corpus will probably either not be learned by the neural network or not be used when generating sentences. Ignoring them will save a lot in model size as the embedding layer and softmax take a lot of parameters. Plus all the infrequent words put together will occur very frequently so the language model will definitely learn to use the unknown token. It also makes it possible to handle new words at test time (that are not found even once in the training corpus) as they will be replaced by the unknown token.

The problem with this approach is what happens when measuring the probability or perplexity of a sentence based on the probabilities of individual words. If you're comparing language models to see which ones make the best predictions, you usually use them all on a corpus to see how well they predict the words in the corpus. The higher the probabilities assigned to those sentences, the better the language model. This is usually measured using language model perplexity. But see what happens when you vary the vocabulary size. You will find that smaller vocabulary sizes lead to better language models, even though this makes no sense.

It turns out that if you just multiply all the probabilities of individual words as-is, including that of the unknown token, then your probability will be sensitive to the vocabulary size. Let's say that you have a vocabulary size of 0, that is, you're considering all the words to be out-of-vocabulary and hence all of them will be replaced by the unknown token. This means that during training, the language model will learn that after every unknown token there is (probably) another unknown token, unless its the end of the sentence. This will make the language model give very high probabilities for the unknown token. High word probabilities mean high sentence probabilities which mean good perplexities.

Now If we add another word to the vocabulary then we'll be introducing some uncertainty into the language model as now it has to decide between using the unknown token or the known word. Even in a perfect language model, the same prefix of words can be followed by either of the two words so there is no way to correctly assign 100% of the probability to one or the other. This means that the probabilities will be split between the two words, leading to an overall decrease in probabilities, leading to a worse perplexity. Adding more words to the vocabulary makes this even worse, which means that language models with smaller vocabularies have a huge unfair advantage over language models that actually do their job and correctly predict the right word.

We can't do away with the unknown token but we can strip away the unknown token's power. Assuming that all the language models are being evaluated on the same corpus, then different vocabularies will have different words being turned into the unknown token. Let's say that your language model considers 1000 different words in its vocabulary but the corpus you're evaluating it on has 500 different words that are out-of-vocabulary. So in reality your language model is predicting one of 1500 different words; it's just that 500 of those words are assumed to be a single word with a single probability. But really there should be 500 separate probabilities for those out-of-vocabulary words and not just one. If we avoid merging all those probabilities into one, then all the language models will have a fair comparison all they will all have the same vocabulary and they will all have the same amount of uncertainty about which word should come next. The question is how to distribute that single unknown token probability between the 500 out-of-vocabulary words. The simplest solution is to assume a uniform distribution and just give each word the same slice of probability from the whole. So if the unknown token has a probability of $p$, then each out-of-vocabulary word gets a probability of $\frac{p}{500}$.

Now every time you encounter the unknown token in the evaluation corpus you know that the token is being used in place of one of those 500 words. But you don't know which one it is. Not a problem, just divide the probability by 500 and it's as if all words in the corpus are in the vocabulary. Do this to every unknown token probability and now you have a fair measure of perplexity. Let's see an example.

Let's say that we want to find the probability of the following sentence:

the loud dog barked at the loud man

and let's say that the language model we're using to do that has the following vocabulary:

the at dog man

this means that the sentence is now changed to:

the UNK dog UNK at the UNK man

Now the naive way to get the probability of the sentence is as follows:

$$
P(\text{the UNK dog UNK at the UNK man}) = \\
p(\text{the}|\text{}) \times p(\text{UNK}|\text{the}) \times p(\text{dog}|\text{the UNK}) \times p(\text{UNK}|\text{the UNK dog}) \\
\times p(\text{at}|\text{the UNK dog UNK}) \times p(\text{the}|\text{the UNK dog UNK at}) \times p(\text{UNK}|\text{the UNK dog UNK at the}) \\
\times p(\text{man}|\text{the UNK dog UNK at the UNK}) \times p(\text{}|\text{the UNK dog UNK at the UNK man})
$$

But now with the new way we'll divide the unknown token's probabilities by 2, the number of different out of vocabulary words ('loud' and 'barked'):

$$
P(\text{the UNK dog UNK at the UNK man}) = \\
p(\text{the}|\text{}) \times p(\text{UNK}|\text{the})/2 \times p(\text{dog}|\text{the UNK}) \times p(\text{UNK}|\text{the UNK dog})/2 \\
\times p(\text{at}|\text{the UNK dog UNK}) \times p(\text{the}|\text{the UNK dog UNK at}) \times p(\text{UNK}|\text{the UNK dog UNK at the})/2 \\
\times p(\text{man}|\text{the UNK dog UNK at the UNK}) \times p(\text{}|\text{the UNK dog UNK at the UNK man})
$$

Of course we can leave the re-weighting till the end of the equation by dividing the first equation by the number of different out-of-vocabulary words for as many times as there are unknown tokens, like this:

$$
P(\text{the UNK dog UNK at the UNK man}) = \\
p(\text{the}|\text{}) \times p(\text{UNK}|\text{the}) \times p(\text{dog}|\text{the UNK}) \times p(\text{UNK}|\text{the UNK dog}) \\
\times p(\text{at}|\text{the UNK dog UNK}) \times p(\text{the}|\text{the UNK dog UNK at}) \times p(\text{UNK}|\text{the UNK dog UNK at the}) \\
\times p(\text{man}|\text{the UNK dog UNK at the UNK}) \times p(\text{}|\text{the UNK dog UNK at the UNK man})/ 2^3
$$

Now the sentence probability goes up as the vocabulary size increases!

Monday, April 3, 2017

Getting the top n most probable sentences using beam search

This is a continuation from a previous blog post on single sentence beam search.

Sometimes it is not enough to just generate the most probable sentence using a language model. Sometimes you want to generate the top 3 most probable sentences instead. In that case we need to modify our beam search a bit. We will make the function a generator that returns a sequence of sentences in order of probability instead of just returning a single most probable sentence. Here are the changes we need to make:

In the single sentence version, we were getting the most probable prefix in the current beam and checking if it is complete. If it is, then we return it and stop there. Instead, we will now not stop until the current beam is empty (or until the caller stops requesting for more sentences). After returning the most probable prefix we will check the second most probable prefix and keep on returning complete prefixes until we either find one which is not complete or we return all the beam. In the case that we return the whole beam then the algorithm stops there as there is nothing left with which to generate new prefixes. This means that the beam width gives a limit on the number of sentences that can be returned. If we do not return all the beam then we continue generating prefixes with the remainder.

In the case that some complete sentences were returned, they need to also be removed from the beam before we continue generating. Since the beam is implemented as a min-first heap queue (min-first because we want to pop the least probable prefix quickly when the beam becomes bigger than the beam width) then we cannot remove the highest probable complete sentence quickly as well. In order to do this, we first turn the heap into a list which is sorted by probability and then start popping out the items at the end if they are complete sentences. Following this, we will then heapify the list back into a min-first heap queue and continue as normal. This sorting and reheapifying should not impact on the performance too much if the beam width is relatively small.

If the clip length is reached then the whole beam is immediately returned in order of probability. This is because as soon as one prefix is equal to the allowed maximum then that means that the entire beam consists of

incomplete sentences that are also as long as the allowed maximum (since all the prefixes grow together)
complete sentences that were found before but which do not have a maximum probability

After returning all the incomplete sentences (they have to be returned since they cannot be extended further) then the complete sentences will also be returned as they will become the most probable sentences of what is left in the beam. The assumption is that if the complete sentence was a nonsensical one, then it wouldn't have remained in the beam so might as well return it as well rather than lose it.

Here is the modified Python 3 code:

import heapq

class Beam(object):

    def __init__(self, beam_width, init_beam=None):
        if init_beam is None:
            self.heap = list()
        else:
            self.heap = init_beam
            heapq.heapify(self.heap) #make the list a heap
        self.beam_width = beam_width

    def add(self, prob, complete, prefix):
        heapq.heappush(self.heap, (prob, complete, prefix))
        if len(self.heap) > self.beam_width:
            heapq.heappop(self.heap)

    def __iter__(self):
        return iter(self.heap)


def beamsearch(probabilities_function, beam_width=10, clip_len=-1):
    prev_beam = Beam(beam_width)
    prev_beam.add(1.0, False, [ '<start>' ])

    while True:
        curr_beam = Beam(beam_width)

        #Add complete sentences that do not yet have the best
probability to the current beam, the rest prepare to add more words to
them.
        for (prefix_prob, complete, prefix) in prev_beam:
            if complete == True:
                curr_beam.add(prefix_prob, True, prefix)
            else:
                #Get probability of each possible next word for the
incomplete prefix.
                for (next_prob, next_word) in probabilities_function(prefix):
                    if next_word == '<end>': #if next word is the end
token then mark prefix as complete and leave out the end token
                        curr_beam.add(prefix_prob*next_prob, True, prefix)
                    else: #if next word is a non-end token then mark
prefix as incomplete
                        curr_beam.add(prefix_prob*next_prob, False,
prefix+[next_word])

        sorted_beam = sorted(curr_beam) #get all prefixes in current beam sorted by probability
        any_removals = False
        while True:
            (best_prob, best_complete, best_prefix) = sorted_beam[-1] #get highest probability prefix
            if best_complete == True or len(best_prefix)-1 ==
clip_len: #if most probable prefix is a complete sentence or has a length that exceeds the clip length (ignoring the start token) then yield it
                yield (best_prefix[1:], best_prob) #yield best sentence without the start token and together with its probability
                sorted_beam.pop() #remove the yielded sentence and check the next highest probability prefix
                any_removals = True
                if len(sorted_beam) == 0: #if there are no more sentences in the beam then stop checking
                    break
            else:
                break

        if any_removals == True: #if there were any changes in the current beam then...
            if len(sorted_beam) == 0: #if there are no more prefixes in the current beam (due to clip length being reached) then end the beam search
                break
            else: #otherwise set the previous beam to the modified current beam
                prev_beam = Beam(beam_width, sorted_beam)
        else: #if the current beam was left unmodified then assign it to the previous beam as is
            prev_beam = curr_beam

Tuesday, December 27, 2016

Bernoulli vs Binomial vs Multinoulli vs Multinomial distributions

Bernoulli distribution

Say you are flipping a coin that has a probability of 0.4 of turning up heads and 0.6 of turning up tails. The simplest probability distribution there is is the Bernoulli distribution which just asks for the probability of the coin turning up heads after one toss, that is, 0.4. That's all. The Bernoulli distribution is only about the probability of a random event with two possible outcomes occurring after one trial. It's used for success/fail type random variables where you want to reason about the probability that a random variable will succeed (such as a coin turning up heads).

The probability mass function for the Bernoulli distribution is:

f(x;p) = p^x * (1 - p)^(1-x)

where "*" is multiplication and "^" is power, "p" is the probability of a success and "x" is the number of successes desired, which can be either 1 or 0. When "x" is 1 then you find the probability of a success whilst when it is 0 then you find the probability of a fail. Since there are only two outcomes, if "p" is the probability of a success then the probability of a non-success is "1-p". Notice that this function is quite silly really as it is just a closed form expression for choosing either "p" or "1-p" depending on what "x" is. The number you're interested in is raised to the power of 1 whilst the number you're not interested in is raised to the power of 0, turning it into 1, and then the two results are multiplied together.

Binomial distribution

One generalization we can make upon the Bernoulli distribution is to consider multiple trails instead of just one and then ask about the probability that a number of successes would occur. This is called the binomial distribution. For example, what is the probability that the coin described above results in exactly 2 heads after tossing it 3 times. There are many ways how this result can come about. You can get [head, head, tail], or [head, tail, head], or [tail, head, head]. For this reason we need to consider the number of different ways the required total can be achieved, which is found by using the combination formula.

The probability mass function for the Binomial distribution is:

f(x;n,p) = n!/(x! * (n-x)!) * p^x * (1 - p)^(n-x)

where "*" is multiplication and "^" is power, "p" is the probability of a success, "n" is the number of trials to try out, and "x" is the number of desired successes out of the "n" trials.

Multinoulli distribution

Whereas the binomial distribution generalises the Bernoulli distribution across the number of trials, the multinoulli distribution generalises it across the number of outcomes, that is, rolling a dice instead of tossing a coin. The multinoulli distribution is also called the categorical distribution.

The probability mass function for the multinoulli distribution is:

f(xs;ps) = ps_1^xs_1 * ps_2^xs_2 * ... * ps_k^xs_k

where "*" is multiplication and "^" is power, "k" is the number of outcomes (6 in the case of a dice, 2 for a coin), "ps" is the list of "k" probabilities where "ps_i" is the probability of the ith outcome resulting in a success, "xs" is a list of numbers of successes where "xs_i" is the number of successes desired for the ith outcome, which can be either 1 or 0 and where there can only be exactly a single "1" in "xs".

Multinomial distribution

Finally the most general generalisation of the Bernoulli distribution is across both the number of trials and the number of outcomes, called the multinomial distribution. In order to be more general, this distribution also allows specifying the number of successes desired for each individual outcome, rather than just of one outcome like the multinoulli distribution. This lets us calculate the probability of rolling a dice 6 times and getting 3 "2"s, 2 "3"s and a "5". Since we want it to be compatible with the binomial distribution, these outcomes can arrival in any order, that is, it doesn't matter whether you get [5,3,2,3,2,2] or [2,2,2,3,3,5]. For this reason we need to use the combination function again.

The probability mass function for the multinomial distribution is:

f(xs;n,ps) = n!/(xs_1! * xs_2! * ... * xs_k!) * ps_1^xs_1 * ps_2^xs_2 * ... * ps_k^xs_k

Sunday, April 17, 2016

The Kullback–Leibler divergence

In a text file, each character is represented by a number of bits. Usually each character consists of 8 bits. But in data compression and information theory we learn that the number of bits need not be the same for each character. The most frequent characters can be given the fewest number of bits whilst the least frequent characters can be given many bits. According to Shannon's information theory, the smallest number of bits needed to represent a character that occurs with probability "p" is

-log2(p)

This means that if a character occurs half the time in a document, then you need -log2(0.5) bits which equals 1. On the other hand a character that occurs an eighth of the time should be represented with -log2(0.125) bits which equals 3. The rarer the character, the more bits should be assigned to it in order to be able to use less bits for the more frequent characters. This will result in compressing the whole document.

The entropy of a document is the expected number of bits per character, that is, the average number of bits in a document. If the number of bits is minimum, as described above, the expected number of bits is

-sum(p log2(p) for all character probabilities p)

Now that we know what entropy is, we can understand what the Kullback-Leibler divergence is. Assume that the number of bits for each character is calculated based on a different document than the one being compressed. Clearly this is a recipe for disaster, but you might want to compute an average probability for each character once based on a representative corpus and then always use these probabilities in all documents, which saves time. By how much will the probabilities diverge? This is what the Kullback-Leibler divergence is used for.

What we'll do is we'll calculate the difference in the average number of bits per character when using the wrong probabilities instead of the correct ones. This is done as follows:

-sum(p log2(q)) - -sum(p log2(p))

Notice how the correct probability for a particular character is "p" but in the first term we're using a different probability "q". This can now be simplified as follows:

-sum(p log2(q)) - -sum(p log2(p))
sum(p log2(p)) - sum(p log2(q))
sum(p log2(p) - p log2(q))
sum(p(log2(p) - log2(q)))
sum(p log2(p/q))

And this is the Kullback-Leibler divergence between the probabilities that "p" comes from and the probabilities that "q" comes from. It is usually used to measure the difference between two probability distributions.

Friday, November 6, 2015

Naive Bayes Classification

The previous post was about Bayes' theorem, so now we'll talk about a use for it in machine learning: Naive Bayes Classification.

Let's say that you're making a program which given the content of a book, will tell you how likely it is that you will like it. In order to do so, it needs to know the contents of books that you like and books that you don't like. Parsing and understanding the content of a book is crazy hard, so you opt for a simpler strategy: You base the decision on the words used in the book. Books that you like use certain words that you like whilst books that you don't like use words that you don't like.

So you come up with a vocabulary of words (perhaps only a small set of words need to be considered) and you count the number of times each word appears a book you like and in a book you don't like. Let's say you end up with a table like this:

% of books that include word

Word\Class	Like book	Hate book
magic	100%	0%
fairy	90%	10%
car	5%	95%
gun	0%	100%

This means that 90% of books that you like contain the word "fairy", which is another way of saying that a book with the word "fairy" has a 90% chance of being a good book.

Now we have a new book and we want to know if we're likely to like it or not. So we check which words it contains and find the following:

Word	Contained?
magic	yes
fairy	yes
car	no
gun	yes

The probability that you'll like the book given that it contains these words is found by calculating

P(Like book | magic=yes, fairy=yes, car=no, gun=yes)

Naive Bayes Classification works by first using Baye's theorem on the above conditional probability:

P(magic=yes, fairy=yes, car=no, gun=yes | Like book) P(Like book) / P(magic=yes, fairy=yes, car=no, gun=yes)

Now that the list of AND conditions (has magic and fairy and...) is at the front of the conditional, we can use the Naive Bayes Assumption and assume that the occurrence of each term is independent from all the other terms. If we assume this, we can simplify the probability by decomposing the ANDed conditions into separate probabilities multiplied together as follows:

P(magic=yes|Like book) P(fairy=yes|Like book) P(car=no|Like book) P(gun=yes|Like book) P(Like book) / (P(magic=yes) P(fairy=yes) P(car=no) P(gun=yes))

Now we can use the table at the top to find P(word|Like book), the probability P(Like book) is the percentage of books that you like (from those used to construct the table), and P(word) is the probability that a book contains the given word (from the books used to construct the table). These percentages are easy to obtain.

The problem is that one of our percentages is a zero, P(gun=yes | Like book). Because of this, when it is multiplied by the other probabilities, the result will be zero. The solution is to disallow zero probabilities by assuming that just because a word does not occur in the books you like, doesn't mean that it will never occur. It might be that there is a very tiny probability that it will occur, but that you don't have enough books to find it. In these situations, we need to smooth our probabilities using Laplace Smoothing by adding 1 to every count.

Naive Bayes Classification can be used to find the most likely class a list of yes/no answers belongs to (such as whether the book contains the given words), but this is just the simplest type of Naive Bayes Classification known as Bernoulli Naive Bayes, so called because it assumes a Bernoulli distribution in the probabilities (a Bernoulli distribution is when there are only 2 possible outcomes from an event with one outcome having a probability of "p" and the other "p-1"). It can also be used on a list of frequencies of the terms by using a Multinomial Naive Bayes or on a list of numbers with a decimal point (such as the weight of the book) using Gaussian Naive Bayes.

Saturday, October 3, 2015

Conditional probabilities and Bayes' theorem

So we all know that when a sports fan asks "What chance does our team have of winning?", the speaker is asking for a probability, but when that same person later asks "What chance does our team have of winning given that John will not be playing?", the speaker is now asking for a conditional probability. In short, a conditional probability is a probability that is changed due to the addition of new information. Let's see an example.

Conditional probabilities

Let's say that we have the following set of numbers, one of which is to be picked at random with equal probability:

The probability of each number being chosen is 1/7. But probabilities are usually based on subsets. So what is the probability of randomly choosing a square number from the above set?

The probability is, of course, 2/7. Now comes the interesting part. Let's say that the number is still chosen at random, but you have the extra information that the number that will be chosen is going to be an even number. In other words, although you don't know which number will be chosen, you do know that it will be an even number. What is the probability that the chosen number will be a square number?

Clearly the added information requires us to change the original probability of choosing a square number. We now have a smaller set of possible choices, only 2 (the red set). From these, there is only 1 square number (the intersection of the red and blue sets). So now the probability of choosing a square number is 1/2.

This is called a conditional probability. Whereas the first non-conditional probability is expressed as follows in mathematical notation:

P(number is square)

the second probability is a conditioned one and is expressed as follows:

P(number is square | number is even)

which is read as "probability that the number is square given that the number is even".

In general,

P(A|B) = P(A,B)/P(B)

where P(A|B) is the probability that event A occurs given that event B has occurred, P(A,B) is the probability that both events occur together (called the joint probability), and P(B) is the probability that event B occurred.

From this, we can derive some pretty interesting equations.

Bayes' theorem

First, it is clear from the above picture that it is straightforward to define P(B|A) by simply dividing by P(A):

P(B|A) = P(A,B)/P(A)

This means that:

P(B|A) P(A) = P(A,B)

and from the other formula, that:

P(A|B) P(B) = P(A,B)

which together mean that:

P(A|B) P(B) = P(B|A) P(A)

and

P(A|B) = P(B|A) P(A)/P(B)

This last equation is known as Bayes' theorem which is something that you'll encounter all the time in probability and artificial intelligence.

In many cases, the probability P(B) is difficult to find, but we can decompose it further by noticing that the probability of selecting from set B depends on whether or not a selection was made from set A. Specifically:

P(B) = P(A) P(B|A) + P(NOT A) P(B|NOT A)

This is saying that the probability of selecting from set B is equal to the probability of one of the following events occurring:

A selection is made from set A and it happens to also be an element in set B: P(A) P(B|A)
A selection is not made from set A but the selected element is in set B: P(NOT A) P(B|NOT A)

Thus Bayes' theorem can be rewritten as

P(A|B) = P(A) P(B|A) / ( P(A) P(B|A) + P(NOT A) P(B|NOT A) )

This is a more practical version of the formula. Let's see a practical example of it.

Bayes' theorem in action

Let's say that you have a robot that is trying to recognise objects in front of a camera. It needs to be able to recognise you when it sees you in order to greet you and fetch you your slippers. The robot sometimes makes mistakes. It sometimes thinks that it saw you when it did not (a false positive) and it sometimes sees you and doesn't realise it (a false negative). We need to calculate how accurate it is. Let's look at the following probability tree:

This tree is showing the following data:

P(you are there) = 0.1
P(you are not there) = 0.9
P(robot detects you | you are there) = 0.85
P(robot detects you | you are not there) = 0.15
P(robot does not detect you | you are there) = 0.05
P(robot does not detect you | you are not there) = 0.95

What is the probability that the robot detects you when you're there?

P(robot detects you AND you are there) =
P(robot detects you, you are there) =
P(you are there) P(robot detects you | you are there) =
0.1 x 0.85 = 0.085

Notice how we could have used the probability tree to calculate this (multiply the probabilities along a branch to AND them).

If the robot detects you, what is the probability that it is correct?

P(you are there | robot detects you) =
P(you are there) P(robot detects you | you are there) / ( P(you are there) P(robot detects you | you are there) + P(you are not there) P(robot detects you | you are not there) ) =
0.1 x 0.85 / ( 0.1 x 0.85 + 0.9 x 0.15 ) = 0.39

This is a small number, even though it correctly detects you 85% of the time. The reason is because you are in front of it only 10% of the time, which means that the majority of the time that it is trying to detect you you are not there. This will make that 15% of the time falsely detecting you pile up. One way to increase the accuracy is to limit the number of times an attempted detection is made in such a way that the probability that you are actually there is increased.

Bayesian inference

There is more to Bayes' theorem than using it to measure the accuracy of a robot's vision. It has interesting philosophical implications in epistemology. This is because it can be used to model the acquisition of knowledge. When used in this way we say that we are performing Bayesian inference. Let's say that you're a detective collecting clues on who committed a murder. You have a suspect in mind that you believe is the murderer with a certain probability. You find a clue which you believe is evidence that incriminates the suspect. This evidence should now increase your probability that the suspect is the murderer. But how do you find the new probability? Enter Bayes' theorem.

The probability you assigned to the suspect before the new evidence is P(H), the probability of the hypothesis, also known as the prior probability.
The new probability that you should assign to the suspect after discovering the evidence is P(H|E), also known as the posterior probability.
Now we use Bayesian inference to calculate the posterior probability as follows:

P(H|E) = P(H)P(E | H) / ( P(H)P(E | H) + P(NOT H)P(E | NOT H) )

The interpretation of this makes sense. The new probability given the evidence depends on two things:

The likelihood that the suspect was the murderer. The smaller this is, the stronger the evidence needs to be to make the hypothesis likely. This is described exactly by the quote "Extraordinary claims require extraordinary evidence".
The probability that the evidence would exist given that the suspect was not the murderer. It could be that the evidence actually supports the null-hypothesis, that is, that the suspect is actually not the murderer. This is determined by comparing the probability of the hypothesis with the probability of the null-hypothesis.

Finally notice also that if you have multiple hypothesis and want to see which is the most likely given a new evidence, we are essentially trying to find the maximum posterior probability of each hypothesis given the same evidence. Given the multiple competing hypothesis H_1, H_2, H_3, etc., the most likely H_i is found by:

argmax_i ( P(H_i)P(E | H_i) / ( P(H_i)P(E | H_i) + P(NOT H_i)P(E | NOT H_i) ) )

But we can simplify this by remembering that the denominator is P(E):

argmax_i ( P(H_i)P(E | H_i) / P(E) )

And of course since P(E) is a constant for each hypothesis, it will not affect which hypothesis will give the maximum posterior probability, so we can leave it out, giving:

argmax_i P(H_i)P(E | H_i)

Wednesday, July 8, 2015

Expected number of uniformly distributed collisions (birthday problem)

Here's an interesting mathematical problem. If you have "n" objects to be inserted into "m" available slots using a uniformly distributed random placement, how many collisions with already occupied slots should we expect to happen? This is useful for hashtables and other data structures where duplicates are not allowed.

Here is a Python 3 program that simulates inserting objects into random positions in an array and counting the average number of collisions.

def collisions(n, m):
 trials = 10000
 total_collisions = 0
 for _ in range(trials):
  slot_is_occupied = [ False for _ in range(m) ]
  for _ in range(n):
   slot = random.randint(0, m-1)
   if slot_is_occupied[slot]:
    total_collisions += 1
   else:
    slot_is_occupied[slot] = True
 return total_collisions/trials

Here is a sample of the average number of collisions given by the above function for different values of "n" and "m":

n\m	1	2	3	4	5	6	7	8	9	10
1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	1.0	0.4947	0.3334	0.2515	0.2054	0.163	0.1437	0.1297	0.1118	0.1053
3	2.0	1.2447	0.8861	0.685	0.557	0.4633	0.4023	0.3537	0.325	0.2819
4	3.0	2.1291	1.5812	1.2588	1.0443	0.9054	0.7754	0.6924	0.6176	0.5459
5	4.0	3.0617	2.4	1.944	1.6457	1.4204	1.2364	1.1123	0.9984	0.9019
6	5.0	4.034	3.265	2.7078	2.304	2.0218	1.7604	1.6004	1.4349	1.318
7	6.0	5.0168	4.1742	3.5406	3.0498	2.6716	2.3973	2.1499	1.9417	1.7744
8	7.0	6.0075	5.1206	4.403	3.8363	3.3905	3.0304	2.7378	2.5151	2.3219
9	8.0	7.0035	6.0765	5.3052	4.6788	4.1652	3.738	3.4168	3.1205	2.8816
10	9.0	8.0016	7.0526	6.2233	5.5401	4.9632	4.493	4.0913	3.7721	3.5016

Basically the answer is the number of objects "n" minus the number of occupied slots. This will give us the number of objects excluding the ones which were inserted without collision, that is, in an empty slot. For example, if I insert 5 objects into an array but at the end there are only 3 occupied slots, then that must mean that 2 of those objects were inserted in the same slot as some other objects (they collided with them).

The question is how to predict the expected number of occupied slots.

Expected number of occupied slots

What is the average number of slots ending up being occupied by at least one object? This previous blog post explains that you basically just need to multiply the probability of a given slot being occupied at the end by the number of slots. So what is the probability of a slot being occupied?

Probability of a slot being occupied

What is the probability that an object is inserted into a particular slot out of "m" slots?
1/m

Therefore the probability that the slot remains empty is
1 - 1/m

What is the probability that the slot is still empty after another placement? It's the probability that the first object did not land on the slot AND that the second object did not land on the slot too. These two probabilities are independent of each other, so
(1 - 1/m)(1 - 1/m)

In general, after "n" objects have been placed, the probability that the slot is still empty is
(1 - 1/m)^n

Notice that this makes sense for n = 0 because if no objects were placed, then the probability that the slot is empty is 1.

Which means that after "n" objects have been placed, the probability that the slot is occupied is
1 - (1 - 1/m)^n

Therefore...

Therefore, the expect number of occupied slots among "m" slots after "n" objects have been inserted with uniform probability is
m(1 - (1 - 1/m)^n)

Which means that the expected number of collisions is
n - m(1 - (1 - 1/m)^n)

Here is the same table as the one at the top showing the corresponding predicted number of collisions:

n\m	1	2	3	4	5	6	7	8	9	10
1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	-0.0	0.0
2	1.0	0.5	0.3333	0.25	0.2	0.1667	0.1429	0.125	0.1111	0.1
3	2.0	1.25	0.8889	0.6875	0.56	0.4722	0.4082	0.3594	0.321	0.29
4	3.0	2.125	1.5926	1.2656	1.048	0.8935	0.7784	0.6895	0.6187	0.561
5	4.0	3.0625	2.3951	1.9492	1.6384	1.4113	1.2387	1.1033	0.9944	0.9049
6	5.0	4.0313	3.2634	2.7119	2.3107	2.0094	1.776	1.5904	1.4394	1.3144
7	6.0	5.0156	4.1756	3.5339	3.0486	2.6745	2.3794	2.1416	1.9462	1.783
8	7.0	6.0078	5.1171	4.4005	3.8389	3.3954	3.0395	2.7489	2.5077	2.3047
9	8.0	7.0039	6.078	5.3003	4.6711	4.1628	3.7481	3.4053	3.118	2.8742
10	9.0	8.002	7.052	6.2253	5.5369	4.969	4.4984	4.1046	3.7715	3.4868

The maximum absolute error between the two tables is 0.0179.

Probabilities are average proportions (expected value)

Intuitively, if a coin flip has a probability of 1/2 of turning out heads, and we flipped the coin 100 times, we expect that 1/2 of those 100 flips will be heads. What is meant by "expect" is that if we do this 100 coin flip experiment for many times, count the number of times it turns out heads for each 100 flip trial, and take the average of these counts, the average will be close to 1/2 of 100. Furthermore, the more 100 flip trials we include in our average, the closer the average will be 1/2 of 100.

If this were the case, then a probability can be treated as an average proportion, because if a probability of something happening is, say, 1/100, then after 1000 attempts we should find that, on average, 1/100 of those 1000 attempts would be the thing happening. In general, if the probability of an outcome is "p", and "n" attempts are made, then we should have "pn" positive outcomes. That probability is acting as a proportion of the average number of attempts made which will result in a positive outcome out of the attempts made. In fact, semantically speaking, the phrase "This outcome occurs with probability 1/100" and the phrase "This outcome occurs once every 100 times" are identical.

A simple proof of this is in the way we estimate the probability of an outcome. We attempt to produce the outcome (such as a coin flip resulting in heads) for a number of times "n", count the number of times "x" the outcome is positive (heads), and then just find x/n. But in order for this probability to be reliable, the quotient must remain constant for different values of "n" (the value "x" will change according to "n" to keep x/n equal). Given this statement, if we know a reliable probability x/n, and have performed the experiment "m" times, then the number of positive outcomes "y" can be predicted as follows:
For x/n to be reliable, x/n = y/m
Therefore, y = m(y/m) = m(x/n)
That is, since x/n is known and "m" is known, "y" can be found using those two values only.

Of course this is not a rigorous proof. To get a rigorous proof we need to turn to a field of probability called expected value. The expected value of a random variable (such as a coin flip) is the average of the values (assumed to be numerical) of the outcomes after a large number of trials. It is defined as the sum of each outcome multiplied by its probability. For example, the expected value of the value on a die is
1*1/6 + 2*1/6 + 3*1/6 + 4*1/6 + 5*1/6 + 6*1/6
because for each outcome from 1 to 6, the probability is 1/6.

In general, if the probability of outcome "o_i" is "p_i", then the expected outcome is
sum(o_i*p_i for all i)

But this isn't useful for proving the statement in the title. The proof is in this Math Exchange answer which explains that the expected number of positive outcomes out of "n" attempts, given that the probability of each outcome each time is "p", is "pn". It goes like this:

Let the random variable "U_i" be the outcome of the "i"th attempt (heads or tails). If the outcome is positive (heads), "U_i" is 1, otherwise it is 0. Given "n" attempts, the number of positive outcomes is
U_1 + U_2 + U_3 + ... + U_n

Call this actual number of positive outcomes "X", that is
X = U_1 + U_2 + U_3 + ... + U_n

The expected value of "X", written as E(X) is
E(X) = E(U_1 + U_2 + U_3 + ... + U_n)

Since the expected value is a linear operator,
E(X) = E(U_1) + E(U_2) + E(U_3) + ... + E(U_n)

Now, given the above definition of what an expected value is,
E(U_i) = 1*(probability of U_i = 1) + 0*(probability of U_i = 0)

If the probability of "U_i" being 1 is "p_i", then
E(U_i) = p_i

But for all "i", the probability of "U_i" is the same. That is
E(U_i) = p

So that means that
E(X) = p + p + p + ... + p
E(X) = pn

And there we have it, the expected number of positive outcomes out of "n" attempts, each of which has a probability of "p", is "pn", which means that the probability "p" can be treated exactly as if it was the proportion of positive outcomes out of a number of trials.

Wednesday, June 24, 2015

Compressed frequencies: Representing frequencies with less bits

In a cache memory you usually store the most frequently used data that is currently in a larger but slower memory. For example, you keep your most frequently accessed files cached in RAM rather than on your hard drive. Since you can't fit all the contents of your hard disk in RAM, you keep only the most frequently used files that can fit. You will still need to access your hard disk once in a while in order to access your less frequently used files but the average file access time will now be greatly reduced.

The problem is how to keep count of the number of times each file is being used in order to know which is the most frequently used. The obvious solution is to associate each file with a number and increment that number each time it is used. But numbers take space as well, and sometimes this becomes a significant problem. You might not afford to waste 4 or 8 bytes of memory worth of frequency integers for every item. Is there a way to bring down the number of bytes used by the frequency integers without losing their usefulness?

Here is an example of a 4 byte int integer number in memory representing the number 45723:

00000000 00000000 10110010 10011011

The most obvious thing you can do is to tear away the most significant bits (the ones on the left which have a larger value) by using smaller range number types such as the 2 byte short, which gives us 10110010 10011011. If the frequency is sometimes, but rarely, larger than the ranges provided by these smaller types, then you can just cap it off by stopping incrementation once the maximum number is reached. For example, if we're using a two byte short, then the maximum number this can be is 65535. Once the frequency reaches this number, then it gets frozen and never incremented again. Many frequences follow a zipfian distribution, meaning that the vast majority of items will have a small frequency, followed by a handful of very frequent items. An example of this is words in a document where most words will only occur once and only a few words such as "the" and "of" will occur frequently. If this is the case then you will be fine with capping off your frequencies since only a few items will have a large frequency and it might not be important to order these high frequency items among themselves.

It might seem more useful instead to tear away the least significant bits (the ones on the right which have a smaller value) instead, since these are less useful. The way you do this is to divide the frequency by a constant and keep only the whole number part. For example, if we divide the above number by 256, we'd be shifting the bits by one byte to the right, which gives us 00000000 00000000 00000000 10110010. The least significant byte has been removed which means that we can use less bytes to store the frequency. But in order to do that you need to first have the actual frequency which defeats the purpose. So what we can do is to simulate the division by incrementing the frequency only once every 256 times. If we do that then the resulting number will always be a 256th of the actual frequency which is the frequency without the least significant byte. But how do you know when to increment the frequency next? If you keep a separate counter which counts to 256 in order to know when to increment next then you lose the space you would have saved. Instead we can do it stochastically using random numbers. Increment the frequency with a probability of 1 in 256 and the frequency will be approximately a 256th of the actual frequency.

By combining these two ideas together we can reduce an 8 byte frequency into a single byte and that byte will be one of the original 8 bytes of the actual frequency. Here is a Python function that increments an integer with a compressed frequency that is a certain number of bytes long and with a certain number of least significant bytes torn off.

def compressed_increment(frequency, bytes_length, bytes_torn):
    if frequency < 256**bytes_length: #cap the frequency to the maximum number that can be contained in bytes_length bytes
        if random.randint(1, 256**bytes_torn) == 1: #increment with a probability of 1 in 256^bytes_length (number to divide by to shift the frequency by that number of bytes)
            return frequency + 1

Of course this is a lossy compression. Information is lost. This means that the compressed frequency is not useful in certain situations, such as when you want to also decrement the frequency or when approximate frequencies are inadequate.

Thursday, May 22, 2014

Boolean Algebra and Probability

Boolean Algebra is a field of algebra which formalizes the operations performed in logic, that is, on propositional statements (statements that are either true or false). This requires operations such as "AND", "OR" and "NOT" in order to compose complex statements from simpler ones, such as "it is raining" and "the road is wet" being composed into something like "NOT(it is raining) AND the road is wet". These operators work in the following way:

The statement "Math is awesome AND One Direction suck" is true because each of the simpler statements being joined together is true. If at least one of the simpler statements is false, then the whole thing becomes false.

The nice thing about this algebra is that it can be naturally extended from normal numeric algebra by representing a true statement with 1 and a false statement with 0. So then we can turn the statement "Math is awesome AND One Direction suck" into "1 AND 1" which is equal to "1". Why does it equal 1? Because for "X AND Y" to be true, both X and Y have be true individually. For example the statement "Math is awesome AND One Direction make great music" is false because one of those substatements is false. With the OR operator, as long as at least one of the substatements is true, the whole thing becomes true. For example "Math is awesome OR One Direction make great music" is true because one of the substatements is true. The NOT operator takes one substatement and inverts it, that is, if it was true then it becomes false and if it was false then it becomes true. For example "NOT Math is awesome" is false and "NOT One Direction make great music" is true. Numerically speaking, the following list summarizes these rules:

0 AND 0 = 0
0 AND 1 = 0
1 AND 0 = 0
1 AND 1 = 1

0 OR 0 = 0
0 OR 1 = 1
1 OR 0 = 1
1 OR 1 = 1

NOT 0 = 1
NOT 1 = 0

Can the be some normal arithmetic equations which will give us these outputs?

NOT is easy. What arithmetic equation returns 1 when given 0 and returns 0 when given 1?
NOT X = 1 - X
1 - 0 gives 1 and 1 - 1 gives 0.

AND is also easy. Just multiply the two operands together.
X AND Y = X × Y

OR is a little more tricky. This is because there is not single operation that gives the same outputs as OR. However we can easily derive an equation by using DeMorgan's laws which express an OR using AND and NOT only. The statement "in order to apply for the job you need a degree or five year's experience" can be expressed as "do not apply for the job if you do not have a degree and have less than five year's experience". Formally this is written as

X OR Y = NOT(NOT X AND NOT Y)

which numerically becomes

X OR Y = 1 - ((1 - X)(1 - Y))
X OR Y = 1 - (1 - Y - X + XY)
X OR Y = 1 - 1 + Y + X - XY
X OR Y = Y + X - XY

This makes sense. The OR operator is the addition of the two substatements minus one in case they are both true. This will return 1 whenever at least one of the two substatements is a 1.

But it also makes sense on a higher scale. If you think about it, these rules are the exact same rules for probability. Probability is a generalization of Boolean algebra. Rather than just considering truth and falsity like Boolean algebra does, probability considers values between those two cases, such that 1 is certainty, 0 is impossibility and all the numbers in between are a degree of uncertainty, where the closer the number is to 1 the more certain the statement is.

In fact AND and NOT are defined in exactly the same way in probability. But isn't OR defined simply as X + Y? That is in fact only the case when X and Y are events which are independent, that is, cannot happen together. In this case X AND Y is always zero and hence can be dropped from OR's equation. The equation we have derived is useful for dependent events as well such as asking for the probability of "two fair dice resulting in an even number and a number greater than 3". In this case you can't just add the probability of a die being even (3/6) and the probability of a die being greater than 3 (3/6) as otherwise you get 3/6 + 3/6 = 1 which means that you are certain that this will happen. The problem is that it is possible for the two substatements to be both true, that is, a die being both even and greater than 3 (namely 4 and 6). To fix this just subtract the probability of this special case occurring (2/6) which gives 3/6 + 3/6 - 2/6 = 2/3.

Let's illustrate it with a Venn diagram:

The Venn diagram above is showing two events as a red and blue circle. They have some overlap between them which is coloured purple. If the red circle represented the chance of a die being even and the blue circle represented the chance of a die being greater than 3, then the purple overlap would be the numbers 4 and 6. Now we want to find the probability of the red OR the blue circle occurring. If we just add their areas together we end up with the purple overlap being added twice since it is common to both circles, but we just want to add it once. So we subtract it once after adding both circles, which will result in having just one purple overlap. This is what is happening when you use the equation X + Y - XY.

Friday, April 12, 2013

The Monty Hall problem

This is an interesting mathematical problem in probability which has an intuitive answer. The only problem is that the intuitive answer is wrong, which is what I love about it. It's called the Monty Hall problem and it goes like this:

Imagine you're in a game show and presented with three doors.

Behind one of the doors is a prize. You are to guess behind which door it is. At this point we should all agree that the chance of guessing the right door would be 1/3. You decide to pick door number 1.

You think that you have sealed you fate, but in an surprising turn of events, the game show host says that in order to help you, since at least one of the two doors left must be wrong, the host is going to tell you that door number 3 is surely wrong.

So now you know that the prize is either behind the door you chose or the other remaining door, door number 2. The host now asks you if you'd rather stick with your first choice or choose the other door.

Did the host actually help you? Is there some advantage to knowing which one of the three doors is surely wrong? Intuitively you'd say that your chances of winning by sticking with your first choice or choosing door number 2 are both 1/2. It's either behind one or the other right? So your chances of winning are equal right? Right guys? Guys?

Most people will not be convinced that this intuition is false. So here's a program I've written in Python 3 which simulates this situation 100 times and it will count the number of times you would have won if you stayed with the first choice and the number of time you would have won if you swapped your choice.

import random

stay_win = 0
swap_win = 0
doors = {"Door 1", "Door 2", "Door 3"}
for _ in range(100): #For 100 times do the following...
    #Select the correct door
    correct = random.choice(list(doors))
    #The contestant makes their choice
    first_choice = random.choice(list(doors))

    #The door which can be eliminated cannot be the correct door or the contestant's choice
    can_eliminate = doors - { correct, first_choice }
    #Select the eliminated door
    eliminated = random.choice(list(can_eliminate))

    #The contestant has only one door left to choose if they swap their door
    choice_left = doors - { first_choice, eliminated }
    #Make this unchosen, non-eliminated door the second choice for swapping the door
    second_choice = choice_left.pop()

    if first_choice == correct:
        stay_win += 1
    if second_choice == correct:
        swap_win += 1

    print("Correct:", correct, " | ", "First choice:", first_choice, " | ", "Eliminated:", eliminated, " | ", "Second choice:", second_choice, " | ", "Won if stay:", first_choice == correct)
print()
print("stay:", stay_win, "swap:", swap_win)

In order to let you scrutinize the result, all the individual situations will be outputted as well:

stay: 29 swap: 71

What? That can't be right! You win by swapping almost twice as often as when you stay with your first choice. Let's try it again!

stay: 39 swap: 61

Um...

stay: 37 swap: 63

Hmm... Why would you win twice as often when you swap than when you stay? The only way to know this is by seeing what needs to happen in order to win by sticking with your first choice and to win by swapping.

Here is what the different possibilities are when the correct door is door number 1:

And here is what the different possibilities are when the correct door is door number 2:

And finally here is what the different possibilities are when the correct door is door number 3:

Did you get it? No? Well, look carefully at what needs to happen in order to win using the first choice. You need to guess the correct door on first try. What are the chances of you doing that? It's 1/3 of course. Could you ever win by choosing a wrong door on first try and stick to it? No.

Well then what do you need to do in order to win by swapping doors? Clearly, you will only lose if you choose the correct door on first try, but if it's the wrong door you've chosen on first try then you'll win. What are the chances of you choosing the wrong door on first try? 2/3, twice as many as choosing the correct one.

Clearly you're twice as likely to win by swapping than by staying. Obviously this doesn't mean that you'll always win, it just means that you're more likely to win. This illustrates a very important point which is that intuition isn't perfect. Sometimes it's just not enough to use intuition in order to predict what will happen, you will need to get down and dirty and actually do boring work in order to make a correct judgement. Never overestimate yourself.