Saturday, January 23, 2016

Gradient Descent

Algebraic method: 1 variable

Say you have an equation like
y = 2x^2 - 3x + 2
that you needed to find the minimum of:


The quadratic equation is minimum (the turning point at the bottom) somewhere between x = 0.5 and x = 1. How can we find where exactly it is minimum? The standard solution is by algebraically finding where its gradient is 0. The gradient is 0 only at a turning point, which is where the minimum is. You can find the gradient at any point by taking the equation's derivative. The derivative with respect to "x" of y = 2x^2 - 3x + 2 is
dy/dx = 4x - 3


The derivative equation is 0 exactly at the same point where the first equation has a minimum turning point. So we algebraically find at which "x" "4x - 3" is 0:
0 = 4x - 3
x = 3/4 = 0.75
Therefore "x" has to be 0.75.

Algebraic method: 2 variables

What if we have 2 variables instead of just "x"? This is what the equation
z = 2x^2 + 2y^2 + 2xy - 6x
looks like:


It clearly has a minimum point somewhere close to (x,y) = (0,0). The way to find the minimum is to take it's partial derivative with respect to "x" and with respect to "y". A partial derivative is when you consider only one of the variables as a variable whilst considering the other variables as constant. This allows you to find how the gradient of the curve changes with respect to one variable. Here is the partial derivative with respect to "x":
∂z/∂x = 4x + 2y - 6
and here is the partial derivative with respect to "y":
∂z/∂y = 4y + 2x

The idea is now to find where the gradient with respect to both variables is 0. The minimum has a gradient of zero from all directions ("x" and "y") so we find where this is so:
0 = 4x + 2y - 6 [A]
0 = 4y + 2x [B]

From [B]
x = -2y [C]

Substituting [C] in [A]
0 = 4(-2y) + 2y - 6
y = -1

Therefore from [C]
x = 2

Therefore both derivatives are equal to 0 when (x,y) = (2,-1) which means that that is where the equation is minimum.

Gradient Descent

Notice how finding the the minimum of a 2 variable equation required the use of solving simultaneous equations. Imagine having a 100 variables and solving 100 simultaneous equations. Plus there is no guarantee that the equations are solvable algebraically. So sometimes we use approximate methods to find the minimum. In this case we can use a method called Gradient Descent. You start by finding the partial derivatives with respect to each variable, then you pick a point on the equation at random and follow the gradient downwards. Let's see an example with the previous equation.

∂z/∂x = 4x + 2y - 6
∂z/∂y = 4y + 2x

Let's pick (x,y) = (0,0) as a point.

∂z/∂x @ (0,0) = 4(0) + 2(0) - 6 = -6
∂z/∂y @ (0,0) = 4(0) + 2(0) = 0

The gradients suggest that at (0,0) the curve is flat at the "y" axis and sliding downwards in the positive direction of the "x" axis. Here is what that means.

If we fix "y" to be 0, "z" would vary with respect to "x" like this:

Notice how at "x" = 0 the slope steeps down towards the positive direction, which means that "x" = 0 is too small and needs to be bigger for us to find the minimum. The gradient at "x" = 0 is -6, which is negative therefore the minimum is at a greater "x" than 0.

If we fix "x" to be 0, "z" would vary with respect to "y" like this:

Notice how at "y" = 0 the slope is flat, which means that "y" is not important to find the minimum at this point and can be left as is. The gradient at "y" = 0 is 0, therefore we can leave "y" as is.

In that case, we stay where we are in the "y" axis direction and increase the "x". By how much should we increase? Is the point (1,0) closer to the minimum than (0,0)? Is it too small an increase? Would (10,0) be a better new point? Or is it too big and should be (0.1,0) instead? How can we get a new point which is closer to the minimum than (0,0) is?

As we approach the minimum, the gradients will start flattening out in both the "x" and "y" directions and start getting closer to 0. This means that we can, as a rule of thumb, increase or decrease the "x" and "y" in the point in proportion to the gradient. The larger the gradient, the steeper the slope, so the more promising moving in that direction would be to find the minimum. On the other hand, the smaller the gradient, the flatter the slope, so moving in that direction would not be very profitable as we're already close to the minimum there and might overshoot. If the gradient is zero, then we are right where we want to be and shouldn't move in that axis at all. So the greater the gradient, the greater the distance we should move.

It's important to understand that the step size should be in proportion to the gradient, not exactly the gradient. We should move at a fraction of the gradient since the gradient will usually be really big compared to the distance required to move and will probably overshoot all the time if used as is. Let's use 0.1 of the gradient.

new x = 0 - 0.1(-6) = 0.6
new y = 0 - 0.1(0) = 0

Notice that we subtracted the gradient so that if the gradient is negative we move towards the positive direction whilst if the gradient is positive we move towards the negative direction since that is how the slope will be oriented.

We check the gradients again to see if we're there yet:

∂z/∂x @ (0.6,0) = 4(0.6) + 2(0) - 6 = -3.6
∂z/∂y @ (0.6,0) = 4(0) + 2(0.6) = 1.2

At our new point we're at a slope sliding downwards towards the positive direction in the "x" and towards the negative direction in the "y" direction. Here are the graphs to confirm.

If we fix "y" to be 0, "z" would vary with respect to "x" like this:

Notice how at "x" = 0.6 the slope steeps down towards the positive direction. The gradient at "x" = 0 is -3.6.

If we fix "x" to be 0.6, "z" would vary with respect to "y" like this:

Notice how at "y" = 0 the slope is steeps down towards the negative directions. The gradient at "y" = 0 is 1.2.

So we need to make "x" bigger and "y" smaller.

new x = 0.6 - 0.1(-3.6) = 0.96
new y = 0 - 0.1(1.2) = -0.12

And we keep on reiterating until the gradient is sufficiently close to 0 in both the "x" and "y" directions or until the change in both "x" and "y" is small enough.

Here's a list of the first 20 points we'll get:
(0, 0)
(0.6, 0.0)
(0.96, -0.12)
(1.2, -0.26)
(1.37, -0.4)
(1.5, -0.51)
(1.6, -0.61)
(1.68, -0.69)
(1.75, -0.75)
(1.8, -0.8)
(1.84, -0.84)
(1.87, -0.87)
(1.9, -0.9)
(1.92, -0.92)
(1.93, -0.93)
(1.95, -0.95)
(1.96, -0.96)
(1.97, -0.97)
(1.97, -0.97)
(1.98, -0.98)

The point gets closer and closer to the actual minimum which is (2,-1). Here are the illustrated points:


See how as the point approaches the minimum it starts moving less and less?

This is called the Gradient Descent algorithm. We can fine-tune its performance by varying the starting point and the step size fraction by which we're multiplying the gradient.

Here's a Python 3 program that performs Gradient Descent given a list of gradient functions (for each variable):
def grad_desc(gradients, initial_values, step_size, threshold):
    old_values = initial_values
    while True:
        new_values = [ value - step_size*gradient(*old_values) for (value, gradient) in zip(old_values, gradients) ]
        if all(abs(new - old) < threshold for (new, old) in zip(new_values, old_values)):
            return new_values
        old_values = new_values
Here's how you would use the previous example:
grad_desc([ lambda x,y:4*x+2*y-6, lambda x,y:4*y+2*x ], [0, 0], 0.1, 0.001)

Sunday, December 13, 2015

(k-)Nearest Neighbour(s) Classification

You want a computer to learn to assign objects into categories, such as the genre of a book. You happen to have a bunch of books with a known category. One of the simplest ways to make the computer assign an unknown book's category is to find the most similar book in the bunch to the unknown book and assume that the two books share the same category. For example, you want to find what genre "Harry Potter" is and find that it is most similar to a book you have called "The Hobbit" which is tagged as fantasy, so you conclude that "Happy Potter" is also fantasy. Of course this only makes sense if you have a big collection of reference books since there might not be any books which are similar otherwise, and the most similar book would be of a genre which is significantly different.

This is called the nearest neighbours classification algorithm, in particular the 1-nearest neighbour, because you only take into consideration the most similar book. Alternatively you can take the top 10 most similar books and use the most frequent genre among the 10 books. This would be called 10-nearest neighbours classification. In general it's called k-nearest neighbours classification.

This is a simple algorithm but its advantage is in its simplicity since it makes no assumptions about the data you give it. Whereas other machine learning algorithms assume that there is some simple pattern to decide which genre a book belongs to, the nearest neighbour classifier can discriminate between very complex patterns and will adapt to any data you train it with, provided that there is enough variety of data. The more complex the relationship between the books and the genre, the more variety of books you need to train it with.

The way it works is by first converting each book in your bunch into a bunch of lists of numbers called a vectors. Each vector would be a point in space (a vector of 2 numbers is a 2D point, of 3 numbers is a 3D point, and the rest are of high dimensional space). For example, in order to convert a book into a point, each number could be the number of times a particular word occurs. Create a vocabulary of words that matter, such as "wizard" and "gun", and then create a point consisting of the number of times each word occurs in the book. So if "Happy Potter" had "wizard" appearing 100 times and "gun" appearing 0 times, then it's 2D point would be (100, 0).

Next, compare the point version of the book in question to the point versions of every book in the bunch. Use some similarity measure to quantify how similar the points are. Similarity measures include Euclidean distance (normal distance between points) and Cosine similarity (difference in the angle of the points from the origin).



Which is the most similar point to the purple one in the above diagram (the purple point is (100, 0) which represents "Harry Potter")? The purple point will be of the same colour as the closest point.

Of course comparing to every point is slow, which is a problem given that nearest neighbour classification requires a lot of points to compare to. There are nearest neighbour search algorithms but they are not very efficient when the points have a lot of dimensions (many numbers in the vector). In some cases it is enough to use approximate search algorithms that do not give exact nearest point, but will find a reasonably close point quickly. The paper "Scaling Distributional Similarity to Large Corpora" gives an overview of such algorithms for finding words that have similar meanings.

If you do not have the genres of the books but still want to categorize similar books together you can use a clustering algorithm such as k-means clustering into order to group books by similarity and then use nearest neighbour classification to associate the new book with the group of the nearest book.

Friday, November 6, 2015

Naive Bayes Classification

The previous post was about Bayes' theorem, so now we'll talk about a use for it in machine learning: Naive Bayes Classification.

Let's say that you're making a program which given the content of a book, will tell you how likely it is that you will like it. In order to do so, it needs to know the contents of books that you like and books that you don't like. Parsing and understanding the content of a book is crazy hard, so you opt for a simpler strategy: You base the decision on the words used in the book. Books that you like use certain words that you like whilst books that you don't like use words that you don't like.

So you come up with a vocabulary of words (perhaps only a small set of words need to be considered) and you count the number of times each word appears a book you like and in a book you don't like. Let's say you end up with a table like this:

% of books that include word
Word\ClassLike bookHate book
magic100%0%
fairy90%10%
car5%95%
gun0%100%

This means that 90% of books that you like contain the word "fairy", which is another way of saying that a book with the word "fairy" has a 90% chance of being a good book.

Now we have a new book and we want to know if we're likely to like it or not. So we check which words it contains and find the following:
WordContained?
magicyes
fairyyes
carno
gunyes

The probability that you'll like the book given that it contains these words is found by calculating
P(Like book | magic=yes, fairy=yes, car=no, gun=yes)

Naive Bayes Classification works by first using Baye's theorem on the above conditional probability:
P(magic=yes, fairy=yes, car=no, gun=yes | Like book) P(Like book) / P(magic=yes, fairy=yes, car=no, gun=yes)

Now that the list of AND conditions (has magic and fairy and...) is at the front of the conditional, we can use the Naive Bayes Assumption and assume that the occurrence of each term is independent from all the other terms. If we assume this, we can simplify the probability by decomposing the ANDed conditions into separate probabilities multiplied together as follows:
P(magic=yes|Like book) P(fairy=yes|Like book) P(car=no|Like book) P(gun=yes|Like book) P(Like book) / (P(magic=yes) P(fairy=yes) P(car=no) P(gun=yes))

Now we can use the table at the top to find P(word|Like book), the probability P(Like book) is the percentage of books that you like (from those used to construct the table), and P(word) is the probability that a book contains the given word (from the books used to construct the table). These percentages are easy to obtain.

The problem is that one of our percentages is a zero, P(gun=yes | Like book). Because of this, when it is multiplied by the other probabilities, the result will be zero. The solution is to disallow zero probabilities by assuming that just because a word does not occur in the books you like, doesn't mean that it will never occur. It might be that there is a very tiny probability that it will occur, but that you don't have enough books to find it. In these situations, we need to smooth our probabilities using Laplace Smoothing by adding 1 to every count.

Naive Bayes Classification can be used to find the most likely class a list of yes/no answers belongs to (such as whether the book contains the given words), but this is just the simplest type of Naive Bayes Classification known as Bernoulli Naive Bayes, so called because it assumes a Bernoulli distribution in the probabilities (a Bernoulli distribution is when there are only 2 possible outcomes from an event with one outcome having a probability of "p" and the other "p-1"). It can also be used on a list of frequencies of the terms by using a Multinomial Naive Bayes or on a list of numbers with a decimal point (such as the weight of the book) using Gaussian Naive Bayes.

Saturday, October 3, 2015

Conditional probabilities and Bayes' theorem

So we all know that when a sports fan asks "What chance does our team have of winning?", the speaker is asking for a probability, but when that same person later asks "What chance does our team have of winning given that John will not be playing?", the speaker is now asking for a conditional probability. In short, a conditional probability is a probability that is changed due to the addition of new information. Let's see an example.

Conditional probabilities

Let's say that we have the following set of numbers, one of which is to be picked at random with equal probability:


The probability of each number being chosen is 1/7. But probabilities are usually based on subsets. So what is the probability of randomly choosing a square number from the above set?


The probability is, of course, 2/7. Now comes the interesting part. Let's say that the number is still chosen at random, but you have the extra information that the number that will be chosen is going to be an even number. In other words, although you don't know which number will be chosen, you do know that it will be an even number. What is the probability that the chosen number will be a square number?


Clearly the added information requires us to change the original probability of choosing a square number. We now have a smaller set of possible choices, only 2 (the red set). From these, there is only 1 square number (the intersection of the red and blue sets). So now the probability of choosing a square number is 1/2.

This is called a conditional probability. Whereas the first non-conditional probability is expressed as follows in mathematical notation:
P(number is square)
the second probability is a conditioned one and is expressed as follows:
P(number is square | number is even)
which is read as "probability that the number is square given that the number is even".

In general,
P(A|B) = P(A,B)/P(B)
where P(A|B) is the probability that event A occurs given that event B has occurred, P(A,B) is the probability that both events occur together (called the joint probability), and P(B) is the probability that event B occurred.


From this, we can derive some pretty interesting equations.

Bayes' theorem

First, it is clear from the above picture that it is straightforward to define P(B|A) by simply dividing by P(A):
P(B|A) = P(A,B)/P(A)

This means that:
P(B|A) P(A) = P(A,B)
and from the other formula, that:
P(A|B) P(B) = P(A,B)
which together mean that:
P(A|B) P(B) = P(B|A) P(A)
and
P(A|B) = P(B|A) P(A)/P(B)

This last equation is known as Bayes' theorem which is something that you'll encounter all the time in probability and artificial intelligence.

In many cases, the probability P(B) is difficult to find, but we can decompose it further by noticing that the probability of selecting from set B depends on whether or not a selection was made from set A. Specifically:
P(B) = P(A) P(B|A) + P(NOT A) P(B|NOT A)
This is saying that the probability of selecting from set B is equal to the probability of one of the following events occurring:
  • A selection is made from set A and it happens to also be an element in set B: P(A) P(B|A)
  • A selection is not made from set A but the selected element is in set B: P(NOT A) P(B|NOT A)

Thus Bayes' theorem can be rewritten as
P(A|B) = P(A) P(B|A) / ( P(A) P(B|A) + P(NOT A) P(B|NOT A) )

This is a more practical version of the formula. Let's see a practical example of it.

Bayes' theorem in action

Let's say that you have a robot that is trying to recognise objects in front of a camera. It needs to be able to recognise you when it sees you in order to greet you and fetch you your slippers. The robot sometimes makes mistakes. It sometimes thinks that it saw you when it did not (a false positive) and it sometimes sees you and doesn't realise it (a false negative). We need to calculate how accurate it is. Let's look at the following probability tree:


This tree is showing the following data:
P(you are there) = 0.1
P(you are not there) = 0.9
P(robot detects you | you are there) = 0.85
P(robot detects you | you are not there) = 0.15
P(robot does not detect you | you are there) = 0.05
P(robot does not detect you | you are not there) = 0.95

What is the probability that the robot detects you when you're there?
P(robot detects you AND you are there) =
P(robot detects you, you are there) =
P(you are there) P(robot detects you | you are there) =
0.1 x 0.85 = 0.085

Notice how we could have used the probability tree to calculate this (multiply the probabilities along a branch to AND them).

If the robot detects you, what is the probability that it is correct?
P(you are there | robot detects you) =
P(you are there) P(robot detects you | you are there) / ( P(you are there) P(robot detects you | you are there) + P(you are not there) P(robot detects you | you are not there) ) =
0.1 x 0.85 / ( 0.1 x 0.85 + 0.9 x 0.15 ) = 0.39

This is a small number, even though it correctly detects you 85% of the time. The reason is because you are in front of it only 10% of the time, which means that the majority of the time that it is trying to detect you you are not there. This will make that 15% of the time falsely detecting you pile up. One way to increase the accuracy is to limit the number of times an attempted detection is made in such a way that the probability that you are actually there is increased.

Bayesian inference

There is more to Bayes' theorem than using it to measure the accuracy of a robot's vision. It has interesting philosophical implications in epistemology. This is because it can be used to model the acquisition of knowledge. When used in this way we say that we are performing Bayesian inference. Let's say that you're a detective collecting clues on who committed a murder. You have a suspect in mind that you believe is the murderer with a certain probability. You find a clue which you believe is evidence that incriminates the suspect. This evidence should now increase your probability that the suspect is the murderer. But how do you find the new probability? Enter Bayes' theorem.

The probability you assigned to the suspect before the new evidence is P(H), the probability of the hypothesis, also known as the prior probability.
The new probability that you should assign to the suspect after discovering the evidence is P(H|E), also known as the posterior probability.
Now we use Bayesian inference to calculate the posterior probability as follows:

P(H|E) = P(H)P(E | H) / ( P(H)P(E | H) + P(NOT H)P(E | NOT H) )

The interpretation of this makes sense. The new probability given the evidence depends on two things:
  • The likelihood that the suspect was the murderer. The smaller this is, the stronger the evidence needs to be to make the hypothesis likely. This is described exactly by the quote "Extraordinary claims require extraordinary evidence".
  • The probability that the evidence would exist given that the suspect was not the murderer. It could be that the evidence actually supports the null-hypothesis, that is, that the suspect is actually not the murderer. This is determined by comparing the probability of the hypothesis with the probability of the null-hypothesis.

Finally notice also that if you have multiple hypothesis and want to see which is the most likely given a new evidence, we are essentially trying to find the maximum posterior probability of each hypothesis given the same evidence. Given the multiple competing hypothesis H_1, H_2, H_3, etc., the most likely H_i is found by:
argmax_i ( P(H_i)P(E | H_i) / ( P(H_i)P(E | H_i) + P(NOT H_i)P(E | NOT H_i) ) )
But we can simplify this by remembering that the denominator is P(E):
argmax_i ( P(H_i)P(E | H_i) / P(E) )
And of course since P(E) is a constant for each hypothesis, it will not affect which hypothesis will give the maximum posterior probability, so we can leave it out, giving:

argmax_i P(H_i)P(E | H_i)

Monday, September 14, 2015

How to make a multiple choice test using Excel

Here is a post for the teachers out there who are technologically savvy enough to use Excel but not quite enough to write a program or web application. It is easy to make your own multiple choice test in Excel which corrects itself, provided that it is feasible to make give a copy of the Excel file to each student and collect them all after the test. This also assumes that the risk of students not saving or accidentally deleting the file is negligible. But I know teachers who actually do this sort of thing so here is how to do it well.

STEP 1: Activate "Developer" tab
Go on File - Options - Customize Ribbon - Select the "Developer" check box - OK:


STEP 2: Add a group box
Go on Developer - Insert - Group box in Form Control:


Then draw the group box and delete the text on it or write your question there:


STEP 3: Add option buttons
Go on Developer - Insert - Option button in Form Control:


Then draw the option button COMPLETELY INSIDE the group box. This is very important, as if you don't draw it completely inside the group box, it will not be associated with other candidate answers of the same question and will not work properly. It's OK to move it outside of the group box afterwards but not before you draw it inside. Delete the text on the option button or write the candidate answer there:


STEP 4: Complete the test
Repeat steps 3 and 2 as needed. If you need to reposition the form elements you first right click on them and then they are movable. Don't worry about accidentally selecting an option button; just make sure that checking an option button in a control group will not affect other control groups. If this happens then it is because the option buttons are not associated with the control group, because they were not drawn inside it, and you will have to draw a new one instead of it.



STEP 5: Automatically check the answers
Right click on the option buttons - Format Control - Set Cell link to a particular cell:


You only need to set one of the option buttons for each question. Make sure that option buttons of different questions all use different cell links. While you're in the Format Control window you can unset any accidentally set option buttons.

You now have the linked cell of each question contain a number which indicates the selected answer:


The number depends on the order in which the option buttons were added. Next write the correct answer next to each linked cell and next to that add a formula which checks if the right answer was selected. The formula is "=G3=H3" where "G3" is the linked cell and "H3" is the cell with the right answer.


The 3 columns on the side show the chosen answer (automatically set by the option buttons), the correct answer (entered by you), and whether the right answer was chosen or not (automatically set by a formula which compares the previous two).

STEP 6: Automatically compute the mark
Finally, add the following formula under the TRUE/FALSE cells: "=COUNTIF(I3:I9,TRUE)" where "I3:I9" is the range of cells which are TRUE/FALSE.


STEP 7: Barricade the Excel sheet
At the moment the answers are in plain sight and everything is editable which makes it unsuitable for a test. So here's how to fix that.

Unlock the changing cells
Start with setting the cells on the side to unlocked. This will allow the sheet to work when you lock it. Highlight the side cells (including the test mark), then right click - Format Cells - Protection - Uncheck both checkboxes:


If you have any cells which you want the students to edit, such as a space to type their name, unlock these cells as well in the exact same way.

Hide the sensitive information
Next we'll hide the side cells. Highlight the columns with the secret information, then right click on the columns and click hide:




Password protect the sheet
Next we'll make it all password protected so that nothing can be changed except the option buttons. Go on Review - Protect Sheet - Set a password - OK:


Also go on Review - Protect Workbook - Set a password - OK:


Now you have a multiple choice test sheet which cannot be tampered with.

STEP 8: Gathering the marks
The test has been taken and everyone saved their Excel sheet. Now you have to collect all the files and find everyone's mark. This would involve opening each file, unprotecting the sheet with your password, unhiding the hidden columns, and reading the mark at the bottom. Pretty daunting, but avoidable.

You can use Excel to read the data in other Excel files. Just save a blank Excel file with all the answer files and add the following formula: "'[john smith.xlsx]Sheet1'!I13" where "john smith.xlsx" is the file name of the answer file, "Sheet1" is the Excel sheet name in the answer file, and "I13" is the cell containing the mark.



Just do this for all files and you've got a nice result sheet. If you want to give a correction you can even check the TRUE/FALSE column of each answer and say which questions were answered wrong. Use "IF('[john smith.xlsx]Sheet1'!I3, "Correct", "You said " & '[john smith.xlsx]Sheet1'!G3 & " instead of " & '[john smith.xlsx]Sheet1'!H3)" where I3 is the cell with the TRUE/FALSE result of the first question, G3 is the cell with the given answer, and H3 is the cell with the correct answer. You can even add another column in the answer sheet with a comment for whoever gets the question wrong.

Keep in mind that students might use a technique like this to read the hidden stuff in your answer file, but that shouldn't be easy to do without getting caught, especially if you hide a lot of columns (more than needed) and put the data in random columns.

The grid format
You can do your multiple choice in the below format where the sheet contains minimal information and the questions and candidate answers are on a printed sheet of paper.


This allows the students to scribble on the paper and to keep the Excel sheet short which saves scrolling.

Friday, August 7, 2015

Predicting the number of nodes in a trie with uniformly distributed strings

A trie is a type of tree that stores strings. Each character of the strings is a node and strings that share a common prefix also share the nodes, which means that a common prefix is only stored once, reducing some redundancy. But how much space is saved by using a trie? In order to answer this question, first we have to calculate the expected number of nodes a trie will have for "n" strings of "m" characters each with "c" possible characters (character set).

Consider the following diagram of a trie that contains the words "me", "if", "in", and "it". In it we have added a new word "my".


The word "my" only required the creation of one new node, since its first letter already existed in the word "me" so that node was shared and not recreated. In general, if a string is inserted in a trie, the number of new nodes created depends on the length of the longest existing prefix in the trie. This length will be the number of nodes that will be shared/reused. The remainder of the string will require new nodes for each character. If the whole string already exists then there will be 0 new nodes whilst if the string is completely new with no existing prefix then there will be a new node for each character. Specifically, for a string of length "m" whose longest existing prefix is of length "p", the number of new nodes created will be "m - p".

The equation we need to figure out looks like the following:
expected number of nodes = (m)(expected number of strings with prefix of length 1 not found)
                           + (m-1)(expected number of strings with prefix of length 2 not found)
                           + (m-2)(expected number of strings with prefix of length 3 not found)
                           + ...
                           + (1)(expected number of strings with prefix of length m not found)

Assuming that the strings are generated using a uniform distribution (any character can appear anywhere in the string), we need to find the expected number of strings out of "n" inserted strings made from "c" possible characters that will have a non-existing prefix of length "p".

This is basically the expected number of strings being selected for the first time when "n" selections are made from among all possible "p" length strings made from "c" possible characters (there are "c^p" possible such prefixes). This is equivalent to saying that it is the expected number of non-collisions when randomly placing "n" objects in "c^p" slots.

In my previous post, I showed that the expected number of collisions when randomly placing "n" objects in "s" slots is
n - s(1 - (1 - 1/s)^n)
which means that the number of non-collisions is
n - (n - s(1 - (1 - 1/s)^n))
which simplifies to
s(1 - (1 - 1/s)^n)
which when we plug in our values becomes
(c^p)(1 - (1 - 1/(c^p))^n)

But there's a problem. The above equation tells you the expected number of non-collisions when considering "p" length prefixes. But consider the previous diagram again. If the word "he" was added, it is true that the length 2 prefix of the word ("he") does not result in a collision, but this does not mean that just 1 new node will be added. In reality, 2 new nodes will be added because it is also true that its length 1 prefix ("h") will also not result in a collision. What this means is that the equation will not give the number of strings which will not result in a collision due to their length "p" prefix only, but also due to their length "p-1" prefix, which is not what we want. To fix this, we subtract from the equation the number of non-collisions due to the shorter prefix:
expected number of strings with prefix of length p not found = (c^p)(1 - (1 - 1/(c^p))^n) - (c^(p-1))(1 - (1 - 1/(c^(p-1)))^n)
Of course this does not apply for the length 1 prefix, so we need to be careful to only apply the subtraction for prefix lengths greater than one.
(You might think that we need to subtract for each shorter prefix length, but when this was tried the result became a negative number. Perhaps some form of inclusion-exclusion principle needs to be applied. Using this equation, the result matches empirical data for many different parameters.)

So, continuing from our earlier equation,
expected number of nodes = (m)((c^1)(1 - (1 - 1/(c^1))^n))
                           + (m-1)((c^2)(1 - (1 - 1/(c^2))^n) - (c^(2-1))(1 - (1 - 1/(c^(2-1)))^n))
                           + (m-2)((c^3)(1 - (1 - 1/(c^3))^n) - (c^(3-1))(1 - (1 - 1/(c^(3-1)))^n))
                           + ...
                           + (1)((c^m)(1 - (1 - 1/(c^m))^n) - (c^(m-1))(1 - (1 - 1/(c^(m-1)))^n))
= sum( c^i - c^i*((c^i-1)/c^i)^n for i in 1..m )

In Python code this becomes:
from fractions import Fraction
def exp_num_trie_nodes(n,m,c):
    return float(sum( c**i - c**i*Fraction(c**i-1,c**i)**n for i in range(1,m+1) ))

Rate of change

Here is a comparison of how the number of nodes increases depending of which variable (n,m,c) is changed:



As "n" increases, the number of nodes added starts slowing down, which makes sense since the more strings there are, the more existing prefixes can be reused. As "m" increases, the number of nodes added starts speeding up and then becomes linear, which makes sense too since longer strings are sparser and thus it would be harder to find a matching prefix which is long from among 100 strings. As "c" increases, the number of nodes added shoots up until a point where it then slows down, almost like it is logarithmic. This is because after a point it will not matter how rare the strings are since there are only 100 strings to choose from among the "c^m" possible strings. Since the length is not increasing, the same number of nodes will be used.

Size of trie

So does using a trie compress a set of strings? Keep in mind that a node takes more space than a character since it needs to point to other nodes whereas strings are arrays without pointers. We'll assume that all strings are the same length in order to reduce the number of variables. This will reduce the amount of information needed for both the set of strings and the trie (no need to include terminator flags for the strings) and the number of strings of maximum length is greater than the total number of shorter strings so it will not be a significant error in representation.

Call the number of nodes in the trie "N(n,m,c)".

The size of the normal set of strings is as follows:
n(m log(c))
where "log(c)" is the size of each character (the number of bits needed to represent each character). Of course this assumes that each string is unique. Tries only store unique strings and the way we compute the number of nodes does not assume that the strings will be unique. So we need to subtract the expected number of repeated strings from among those "n" strings. The number of repeated strings is equal to the number of collisions when placing "n" objects in "c^m" slots.
Array of strings: n(m log(c)) - (n - (c^m)(1 - (1 - 1/(c^m))^n))

The size of the trie is as follows:
N(n,m,c)(k(log(c) + log(N(n,m,c))))
where "log(c)" is the size of each character (the number of bits needed to represent each character), "log(N(n,m,c))" is the size of a pointer (which at minimum would be the logarithm of the number of nodes), and "k" is the number of pointers used on average per node. Given that the majority of the nodes in a trie will be leaf nodes, the majority of nodes will not have children. In fact the average will be less than one child per node. If arrays are used, "k" must be equal to "c", but if a linked list is used then "k" is the average but we have to also include the linked list pointer size with each character. The pointer size of the linked lists can be assumed to be "log(N(n,m,c))" since the total number of child nodes is equal to the number of nodes (minus the root node).
Array based: N(n,m,c)(c(log(c) + log(N(n,m,c))))
Linked list based: N(n,m,c)(k(log(c) + log(N(n,m,c)) + log(N(n,m,c))))

Here is a graph showing how the set of strings, array based trie, and linked list based trie increase in size with "n" when "c" is 5, "m" is 5, and "k" is 0.9:



It is clear that an array based trie cannot be used to compress a collection of strings as nodes take too much space. But what if we changed the value of "k" in the linked list based trie?



This shows that unless you have an average number of children per node of 0.2 or less, the array of strings will always take less space. Notice that this says nothing about tries which attempt to minimize the number of nodes such as radix trees where a single node represents a substring rather than a character. Also notice that this is about uniformly distributed strings, not linguistic strings which have a lot of redundancy. In a future post I shall make empirical comparisons on linguistic data.

Wednesday, July 8, 2015

Expected number of uniformly distributed collisions (birthday problem)

Here's an interesting mathematical problem. If you have "n" objects to be inserted into "m" available slots using a uniformly distributed random placement, how many collisions with already occupied slots should we expect to happen? This is useful for hashtables and other data structures where duplicates are not allowed.

Here is a Python 3 program that simulates inserting objects into random positions in an array and counting the average number of collisions.

def collisions(n, m):
 trials = 10000
 total_collisions = 0
 for _ in range(trials):
  slot_is_occupied = [ False for _ in range(m) ]
  for _ in range(n):
   slot = random.randint(0, m-1)
   if slot_is_occupied[slot]:
    total_collisions += 1
   else:
    slot_is_occupied[slot] = True
 return total_collisions/trials

Here is a sample of the average number of collisions given by the above function for different values of "n" and "m":
n\m12345678910
10.00.00.00.00.00.00.00.00.00.0
21.00.49470.33340.25150.20540.1630.14370.12970.11180.1053
32.01.24470.88610.6850.5570.46330.40230.35370.3250.2819
43.02.12911.58121.25881.04430.90540.77540.69240.61760.5459
54.03.06172.41.9441.64571.42041.23641.11230.99840.9019
65.04.0343.2652.70782.3042.02181.76041.60041.43491.318
76.05.01684.17423.54063.04982.67162.39732.14991.94171.7744
87.06.00755.12064.4033.83633.39053.03042.73782.51512.3219
98.07.00356.07655.30524.67884.16523.7383.41683.12052.8816
109.08.00167.05266.22335.54014.96324.4934.09133.77213.5016

Basically the answer is the number of objects "n" minus the number of occupied slots. This will give us the number of objects excluding the ones which were inserted without collision, that is, in an empty slot. For example, if I insert 5 objects into an array but at the end there are only 3 occupied slots, then that must mean that 2 of those objects were inserted in the same slot as some other objects (they collided with them).

The question is how to predict the expected number of occupied slots.

Expected number of occupied slots

What is the average number of slots ending up being occupied by at least one object? This previous blog post explains that you basically just need to multiply the probability of a given slot being occupied at the end by the number of slots. So what is the probability of a slot being occupied?

Probability of a slot being occupied

What is the probability that an object is inserted into a particular slot out of "m" slots?
1/m

Therefore the probability that the slot remains empty is
1 - 1/m

What is the probability that the slot is still empty after another placement? It's the probability that the first object did not land on the slot AND that the second object did not land on the slot too. These two probabilities are independent of each other, so
(1 - 1/m)(1 - 1/m)

In general, after "n" objects have been placed, the probability that the slot is still empty is
(1 - 1/m)^n

Notice that this makes sense for n = 0 because if no objects were placed, then the probability that the slot is empty is 1.

Which means that after "n" objects have been placed, the probability that the slot is occupied is
1 - (1 - 1/m)^n

Therefore...

Therefore, the expect number of occupied slots among "m" slots after "n" objects have been inserted with uniform probability is
m(1 - (1 - 1/m)^n)

Which means that the expected number of collisions is
n - m(1 - (1 - 1/m)^n)

Here is the same table as the one at the top showing the corresponding predicted number of collisions:

n\m12345678910
10.00.00.00.00.00.00.00.0-0.00.0
21.00.50.33330.250.20.16670.14290.1250.11110.1
32.01.250.88890.68750.560.47220.40820.35940.3210.29
43.02.1251.59261.26561.0480.89350.77840.68950.61870.561
54.03.06252.39511.94921.63841.41131.23871.10330.99440.9049
65.04.03133.26342.71192.31072.00941.7761.59041.43941.3144
76.05.01564.17563.53393.04862.67452.37942.14161.94621.783
87.06.00785.11714.40053.83893.39543.03952.74892.50772.3047
98.07.00396.0785.30034.67114.16283.74813.40533.1182.8742
109.08.0027.0526.22535.53694.9694.49844.10463.77153.4868

The maximum absolute error between the two tables is 0.0179.