## Saturday, October 21, 2017

### Hyperparameter tuning using Hyperopt

One of the most tedious but important things there is in machine learning is tuning the hyperparameters of your machine learning algorithm, such as the learning rate and initial parameters in gradient descent. For example you might want to check how your gradient descent algorithm performs when the learning rate is 1.0, 0.1, or 0.01 and the initial parameters being randomly chosen between -10.0 and 10.0 or between -1.0 and 1.0.

The problem with doing this is that unless you have several supercomputers at your disposal, testing the learning algorithm is a slow process and there are so many different ways to configure your hyperparameters. In the above example you'd have to try six different configurations: (1.0,(-10.0,10.0)), (1.0,(-1.0,1.0)), (0.1,(-10.0,10.0)), etc.. An exhaustive search (grid search) would take too long if each test takes very long and in practice you'd have a search space which is much bigger than six. We can test a random sample of the configurations but ideally the sample would be chosen intelligently rather than randomly. We could start out by trying some randomly chosen configurations and then start homing in on some of the more promising hyperparameter choices.

This is where the Python library Hyperopt comes in. It's a simple library that searches a space of hyperparameters in a rational way so that you'll end up with a better result after trying 100 configurations than if you just randomly sampled 100 different configurations and picked the best performing one. It does this by using a tree of Parzen Estimators.

Let's start with a simple gradient descent example that finds the minimum of "y = x^2". We'll write a function that takes in a learning rate, and a range within which to initialize "x" (we'll only ask for the maximum of this range and assume that the negative of this number is the minimum). We'll then apply gradient descent on the initial "x" value for 10 epochs. The function will finally return the value of "x" that was found near the minimum.

import random

def loss(x):
return x**2

return 2*x

x = random.uniform(-max_init_x, max_init_x)
for _ in range(10):
return x


Great, now we want to find the best hyperparameters to use. In practice we'd have a validation set for our machine learning model to learn to perform well on. Once the hyperparameters that result in the best performance on the validation set are found, we'd apply them to learn on a separate test set and it is this performance that is used to judge the quality of the learning algorithm. However, for this blog post we'll instead focus on the simpler mathematical optimization problem of finding the minimum of "y = x^2".

This is how you use Hyperopt to find the best hyperparameter combination. First you create a function called an objective function that takes in a tuple with the chosen hyperparameters and returns a dictionary that is read by hyperopt to assess the fitness of the chosen hyperparameters. Then you take this function and the possible hyperparameter choices you want to allow and pass them to the hyperopt function called "fmin" which finds the hyperparameters that give the smallest loss.

import hyperopt

def hyperopt_objective(hyperparams):
(learning_rate, max_init_x) = hyperparams
return {
'loss':   l,
'status': hyperopt.STATUS_OK,
}

best = hyperopt.fmin(
hyperopt_objective,
space     = [
hyperopt.hp.choice('learning_rate', [ 1.0, 0.1, 0.01 ]),
hyperopt.hp.choice('max_init_x',    [ 10.0, 1.0 ]),
],
algo      = hyperopt.tpe.suggest,
max_evals = 10
)
print(best)


The output of this program is this:
{'learning_rate': 1, 'max_init_x': 1}

This is saying that the best loss is obtained when the learning rate is the item at index 1 in the given list (0.1) and the maximum initial value is the item at index 1 in the given list (1.0). The "space" parameter in fmin is there to say how to construct a combination of hyperparameters. We specified that we want a list of two things: a learning rate that can be either 1.0, or 0.1, or 0.01, and a maximum initial value that can be either 10.0 or 1.0. We use "hp.choice" to let fmin choose among a list. We could instead use "hp.uniform" in order to allow any number within a range. I prefer to use a list of human friendly numbers instead of allowing any number so I use the choice function instead. We have also said that we want to allow exactly 10 evaluations of the objective function in order to find the best hyperparameter combination.

Although this is how we are expected to use this library, it is not very user friendly in general. For example there is no feedback given throughout the search process so if each evaluation of a hyperparameter combination takes hours to complete then we could end up waiting for several days without seeing anything, just waiting for the function to return a value. The return value also requires further processing as it's just indexes. We can make this better by adding some extra stuff to the objective function:

eval_num = 0
best_loss = None
best_hyperparams = None
def hyperopt_objective(hyperparams):
global eval_num
global best_loss
global best_hyperparams

eval_num += 1
(learning_rate, max_init_x) = hyperparams
print(eval_num, l, hyperparams)

if best_loss is None or l < best_loss:
best_loss = l
best_hyperparams = hyperparams

return {
'loss':   l,
'status': hyperopt.STATUS_OK,
}


We can now see each hyperparameter combination being evaluated. This is the output we'll see:
1 1.6868761146697238 (1.0, 10.0)
2 0.34976768426779775 (0.01, 1.0)
3 0.006508209785146999 (0.1, 1.0)
4 1.5999357079405185 (0.01, 10.0)
5 0.2646974732349057 (0.01, 1.0)
6 0.5182259594937579 (1.0, 10.0)
7 53.61565213613977 (1.0, 10.0)
8 1.8239879002601682 (1.0, 10.0)
9 0.15820396975495435 (0.01, 1.0)
10 0.784445725853568 (1.0, 1.0)


Also, the variable best_hyperparams will contain the tuple with the best hyperparameters found. Printing best_hyperparams will show "(0.1, 1.0)". We can even save the best hyperparameters found till now in a file so that we can stop the search early if we run out of patience.

Here is the full code in one place:

import random
import hyperopt

def loss(x):
return x**2

return 2*x

x = random.uniform(-max_init_x, max_init_x)
for _ in range(10):
return x

eval_num = 0
best_loss = None
best_hyperparams = None
def hyperopt_objective(hyperparams):
global eval_num
global best_loss
global best_hyperparams

eval_num += 1
(learning_rate, max_init_x) = hyperparams
print(eval_num, l, hyperparams)

if best_loss is None or l < best_loss:
best_loss = l
best_hyperparams = hyperparams

return {
'loss':   l,
'status': hyperopt.STATUS_OK,
}

hyperopt.fmin(
hyperopt_objective,
space     = [
hyperopt.hp.choice('learning_rate', [ 1.0, 0.1, 0.01 ]),
hyperopt.hp.choice('max_init_x',    [ 10.0, 1.0 ]),
],
algo      = hyperopt.tpe.suggest,
max_evals = 10
)

print(best_hyperparams)


To find out more about Hyperopt see this documentation page and the Github repository.

## Monday, September 18, 2017

### Find an equation that passes through a set of points (Lagrange polynomial interpolation)

When I was in school, I remember learning about how to find the equation of a line that passes through two points. When I asked how to find the equation of a curve instead I was told that this was not possible, probably because there is an infinite number of curves that pass through a finite set of points (as we'll see below). However I would have been happy to learn about this simple method to produce a polynomial curve that passes through any given set of points. In general this task is called interpolation and the particular interpolating method we shall see here is called Lagrange polynomials.

A polynomial is an equation of the form

$$a_n x^n + a_{n-1} x^{n-1} + \dots + a_1 x + a_0$$

For example $2x^3 + 4x^2 - 3x + 1$ is a polynomial where 2 is $a_3$, 4 is $a_2$, 4 is $a_1$ and 1 is $a_0$. A polynomial plus another polynomial is also a polynomial, for example $(x^2 + 2x) + (3x + 2) = x^2 + 5x + 2$. A polynomial times another polynomial is also a polynomial, for example $(x^2 + 2x)(3x + 2) = 3x^3 + 8x^2 + 4x$.

Now on to polynomial interpolation. Let's start with the simplest case, when you have only one point. Let's say you want a polynomial that passes through (3,2), that is, when $x$ is 3, the polynomial should equal 2. In this case our equation can simply be

$$y = 2$$

that is, the polynomial is just 2. The equation is always 2 regardless of where $x$ is so we have found our polynomial.

Now on to a more interesting case. Let's say we want to pass through these points:
• (3,2)
• (4,3)
• (6,4)

The trick is to add up a set of polynomials each of which goes through one point. We need a polynomial that equals 2 when $x$ is 3, another polynomial that equals 3 when $x$ is 4, and another polynomial that equals 4 when $x$ is 6. But we have to be careful as we don't want these three polynomials to interfere with each other after being added up together. For example if we can't just do $y = (2) + (3) + (4)$ as the two terms will interfere with each other and the result will not pass through any of the points. Instead we need each polynomial to have a shape such that each one passes through its corresponding point but then passes through 0 in place of the remaining points. The three polynomials we need to add up together are:
• Polynomial corresponding to (3,2): must equal 2 when $x$ is 3 and equal 0 when $x$ is 4 or 6.
• Polynomial corresponding to (4,3): must equal 3 when $x$ is 4 and equal 0 when $x$ is 3 or 6.
• Polynomial corresponding to (6,4): must equal 4 when $x$ is 6 and equal 0 when $x$ is 3 or 4.
Since the polynomials equal 0 where the other points are then they will not interfere with each other where the polynomials pass through their corresponding point.

Let's focus on the zeros first. There is an easy trick to ensure that a polynomial goes through zero at certain values of $x$. If you want your polynomial to be zero when $x$ is 4 or 6, then use $(x-4)(x-6)$. In this polynomial, when $x$ is 4 then the first bracket will equal 0, which makes the whole thing 0, and when $x$ is 6 the second bracket will equal 0 which will also make the whole thing 0. This is what $y = (x-4)(x-6)$, $y = (x-3)(x-6)$ and $y = (x-3)(x-4)$ looks like:

See how each curve passes through zero at every point except one? That exception point will be the corresponding point of each polynomial. Adding these polynomials up will not make them interfere with each other at the x-values of each point. But in order to get the resultant polynomial to pass through the points, we need each separate polynomial to pass through its corresponding point whilst still being zero at every other point. For this we apply another easy trick which is to multiply the polynomials by a number. Multiplying an equation by a number will not change where it is zero. It will change the shape of the curve but the places at which it is zero will not be moved. This is what $(x-4)(x-6)$ looks like after being multiplied by 2, 0.5, and -1:

So we just need to find the number that when multiplied by each separate polynomial will make it pass through its corresponding point. This number is the y-value of the corresponding point divided by the current value of the polynomial there. What we're doing is first making the polynomial equal 1 at its corresponding point and then multiplying it by whatever number we want it to be. For example if you want to multiply the number 2 by a number that will make the result 3, that number should be $\frac{3}{2}$, that is, $2 \times \frac{3}{2} = 3$.

So what is the current value of $(x-4)(x-6)$ at its corresponding point, (3,2)? It's $(3-4)(3-6) = 3$. So by multiplying $(x-4)(x-6)$ by $\frac{2}{(3-4)(3-6)}$ we can make the polynomial pass through 2, without changing where it is zero. This is what $y = (x-4)(x-6)\frac{2}{(3-4)(3-6)}$, $y = (x-3)(x-6)\frac{3}{(4-3)(4-6)}$ and $y = (x-3)(x-4)\frac{4}{(6-3)(6-4)}$ looks like:

Now we can add them up into
$$y = (x-4)(x-6)\frac{2}{(3-4)(3-6)} + (x-3)(x-6)\frac{3}{(4-3)(4-6)} + (x-3)(x-4)\frac{4}{(6-3)(6-4)}$$
which is simplified into $-\frac{1}{6} x^2 + \frac{13}{6} x - 3$

If we added another point at (10,7) then the polynomial passing through all the points would be
$$y = \frac{2(x-4)(x-6)(x-10)}{(3-4)(3-6)(3-10)} + \frac{3(x-3)(x-6)(x-10)}{(4-3)(4-6)(4-10)} + \frac{4(x-3)(x-4)(x-10)}{(6-3)(6-4)(6-10)} + \frac{7(x-3)(x-4)(x-6)}{(10-3)(10-4)(10-6)}$$

See how we have found another curve that also passes through the previous 3 points? This equation building method proves that you can find an infinite number of polynomials that pass through a finite number of points, since you can always make a polynomial that passes through the given set of points plus any other point anywhere where each position of the new point requires a differently shaped curve. Since there is an infinite amount of positions where to place the extra point, there is an infinite number of polynomials (and thus curves) that can pass through the given set of points.

In general, given points $(x_1,y_1), (x_2,y_2), \dots, (x_n,y_n)$, your polynomial will be
$$\sum^{n}_{i=1} y_i \frac{\prod^{n}_{j=1,j\neq i} (x-x_j)}{\prod^{n}_{j=1,j\neq i} (x_i-x_j)}$$

Notice that you cannot have two points with the same x-value or you'll get a division by zero. Notice also how you are not guaranteed to get an extrapolation from your points. The polynomial will completely derail into an upward or downward curve beyond the range of the given points, regardless of any pattern suggested by the points. Finally your equation will get more and more complicated with every new point, even after simplification. If you have $n$ points then you will have up to $n$ terms in your polynomial. It might be simplifiable into less terms, but that is rare.

## Monday, August 28, 2017

### Proof that the sum of digits of any multiple of 3 (or 9) is itself a multiple of 3 (or 9)

Take any multiple of 3, such as 327. You can verify that it really is a multiple of 3 by adding together all its digits, that is 3+2+7=12, and checking if the sum is itself a multiple of 3. 12 is a multiple of 3, so 327 must also be a multiple of 3. Since the sum of digits will be a much smaller number than the original number, it will be easier to determine if it is a multiple of 3 or not. Of course you can then take the sum of digits and find its sum of digits again repeatedly until it's a single digit number: 3, 6, 9. The same thing is true of multiples of 9 but instead the sum of digits will be a multiple of 9.

This is the proof that this works for every multiple of 3:

Start with a number with $n$ digits, such as the 3 digit number 327. We'll call each of these digits $a_i$, starting from $a_{n-1}$ for the first digit and ending with $a_0$ for the last digit. So 327 has $a_2 = 3$, $a_1 = 2$, and $a_0 = 7$.

Our number is equal to the following:
$$a_{n-1} 10^{n-1} + a_{n-2} 10^{n-2} + \ldots + a_1 10^1 + a_0 10^0$$
This is just the hundreds, tens, units formulation of numbers. For example:
$$327 = 3 \times 10^2 + 2 \times 10^1 + 7 \times 10^0 = 3 \times 100 + 2 \times 10 + 7 \times 1 = 327$$

Now the trick is to replace each $10^x$ with $999...+1$. For example $100 = 99+1$.

$$\begin{eqnarray} a_{n-1} 10^{n-1} + a_{n-2} 10^{n-2} + \ldots + a_1 10^1 + a_0 10^0 &=& a_{n-1} $999... + 1$ + a_{n-2} $99... + 1$ + \ldots + a_1 $9 + 1$ + a_0 $0 + 1$ \\ &=& $a_{n-1} 999... + a_{n-1}$ + $a_{n-2} 99... + a_{n-2}$ + \ldots + $a_1 9 + a_1$ + $a_0 0 + a_0$ \\ &=& $a_{n-1} 999... + a_{n-2} 99... + \ldots + a_1 9 + a_0 0$ + $a_{n-1} + a_{n-2} + \ldots + a_1 + a_0$ \\ \end{eqnarray}$$

Now we'll make some terms switch places in the equation:
$$\begin{eqnarray} a_{n-1} + a_{n-2} + \ldots + a_1 + a_0 &=& $a_{n-1} 10^{n-1} + a_{n-2} 10^{n-2} + \ldots + a_1 10^1 + a_0 10^0$ - $a_{n-1} 999... + a_{n-2} 99... + \ldots + a_1 9 + a_0 0$ \\ &=& $a_{n-1} 10^{n-1} + a_{n-2} 10^{n-2} + \ldots + a_1 10^1 + a_0 10^0$ - $a_{n-1} 999... + a_{n-2} 99... + \ldots + a_1 9$ \end{eqnarray}$$

Notice that in the last line above we have eliminate the term $a_0 0$ because it's a multiplication by 0.

Now, if our original number is a multiple of 3, then it must be that the right hand side at the end of the above equation is a multiple of 3. Any string of 9s is always a multiple of 3 since $999\ldots = 3 \times 3 \times 111\ldots$. It is also a multiple of 9, but we'll get to that later. This means that the two bracketed terms in the last line of the above equation are both multiples of 3 because:
• $a_{n-1} 10^{n-1} + a_{n-2} 10^{n-2} + \ldots + a_1 10^1 + a_0 10^0$: is a multiple of 3 by definition (we said that the number would be one).
• $a_{n-1} 999... + a_{n-2} 99... + \ldots + a_1 9$: is a sum of numbers multiplied by strings of 9s, which means that we can factor out the common factor 3.

The difference of two multiples of 3 is itself a multiple of 3. Therefore the left hand side must also be a multiple of 3 (since the two sides are equal), and the left hand side just happens to be:

$$a_{n-1} + a_{n-2} + \ldots + a_1 + a_0$$

which is the sum of all digits. Hence for any number that is a multiple of 3, the sum of its digits must also be a multiple of 3.

Until now we have only shown that any multiple of 3 will have the sum of its digits also be a multiple of 3. But can a number which is not a multiple of 3 coincidentally have the sum of its digits be a multiple of 3? No, because otherwise it would imply that a non-multiple of 3 minus a multiple of 3 is a multiple of 3. The second term in the subtraction in the right hand side of the above derivation is always going to be a multiple of 3, but in order for the whole of the right hand side to be a multiple of 3 you will need both terms being a multiple of 3. So you can rest your mind that checking if a large number is a multiple of 3 can be as simple as checking if the sum of its digits is a multiple of 3. And if that sum is still too big you can check if the sum of its digits are a multiple of 3 and so on, because you're always just reusing the same test each time.

What about multiples of 9? As already mentioned above, a string of 9s is also always a multiple of 9. So the second term in the subtraction above is also always a multiple of 9. Hence if both terms of the subtraction are multiples of 9, then the sum of digits is going to be a multiple of 9.

## Friday, July 14, 2017

### Making a point and click / room escape game using PowerPoint

Before we begin, it is important that you understand that there is no simple way to make a complex point and click game with PowerPoint where actions in one room affect what happens in other rooms. You can only make games where everything happens in the same room. You can have different rooms but all the puzzles of a room must be solved within the room. This isn't so bad. There are existing games where you have to go through several levels with each level having a small puzzle to solve, such as this. You also cannot have an inventory or pass codes without going into a lot of complexity that would make more sense using actual game programming. That said, it can still be a creative use of PowerPoint for children so here is how you do it.

The heart of the point and click game is the discovery of items that can be used to get over obstacles. You find a key and then you open a door. In PowerPoint such a thing can be done using transparent boxes that are in front of clickable objects. When you click on a key, a trigger event will set an exit animation on the transparent box, making whatever was behind it available for clicking. Let's begin.

Start by drawing a key and a door. I'm using a star to represent a key and I put a star shape on the door to represent a key hole. Also add a second slide with the next room after going through the door. I just put a smiley on the next slide to show that the game has ended.

Next make the door take you to the next slide when clicked. This is done using a hyperlink. Just right click on the door, go on hyperlink, place in this document, next slide, OK. Now when you start the presentation, clicking on the door will take you to the next slide.

Of course clicking anywhere in a slide will take you to the next slide. We don't want that. You should only be able to go to the next slide by clicking on the door. Go on transitions and unset advance slide on mouse click.

Now put a box over the door covering it completely. Make this box transparent (not "no fill"). The door is now not clickable any more because there is a box over it. Just right click the box, format shape, fill, transparency 100%. I left the border there so that you can see where the box is but you can remove the outline so that it looks better.

Now we want the key to disappear when clicked on. Click on the key, go on animations, add animation, and choose an exit animation such as Fade. Open the animation pane. You'll see the animation you just created and the automatically assigned name of the key. In my case the name is "4-Point Star 5"

At the moment, the key will disappear as soon as you click anywhere, not when you click on the key. To fix this we can click on the key animation (the orange box with "1" in it next to the key), go on trigger, on click of, and choose the name of the key shape you saw in the animation pane. Now the key's exit animation will only start when you click on the key itself.

When the key is picked up, the door should be usable. This mean that we want the transparent box over the door to disappear. We can do the same exit animation with trigger that we did with the key to the transparent box and make it exit when the key is clicked on. I made the door disappear rather than fade so that there is no delay and the transparent box is removed immediately.

As things stand, we have two exit animations: one for the key and one for the transparent box. However these two animations will not happen together but will each wait for their own click on the key. To make both animations happen together just click on the transparent box's animation and choose start with previous instead of start on click.

That's it. Your one-key-one-door point and click game is ready. Start the animation by pressing F5 and see what happens. Notice that you cannot click on anything other than the key. After clicking on the key it will disappear and then you can click on the door. Clicking on the door will take you to the next room.

This basic mechanic that I just described can be used to open chests that contain other keys, kill dragons that guard lairs, and turn off booby traps. You can put keys behind objects that have a motion animation when clicked on so that you find a key behind them. You can even put multiple keys, each of which is accessible only after getting the previous one and the final key is the one that opens the door. You can also have multiple rooms where the next slide is another is another room with a locked door. Can you think of other things to make an interesting point and click game in PowerPoint?

## Sunday, June 25, 2017

### Display all outputs in Python immediately when in cmd/terminal

When running a Python program in the terminal, it sometimes happens that there is a long delay before I see some print statement's output. Sometimes I have to wait until the program finishes.

If it's just one print statement you're concerned with then you only need to import the "sys" module and call "sys.stdout.flush()" in order to force the program to show anything that has been printed but is still hidden in the buffer.

import sys

print('bla bla bla')
sys.stdout.flush()


But most of the time you want this to always happen and not after some particular point in the program. To force Python to always show whatever is printed just add the command line option "-u" when running python.

> python -u myscript.py


https://docs.python.org/3.6/using/cmdline.html#cmdoption-u

## Tuesday, May 30, 2017

### Neural language models and how to make them in Tensorflow 1.0

In this blog post I will explain the basics you need to know in order to create a neural language model in Tensorflow 1.0. This tutorial will cover how to create a training set from raw text, how to use LSTMs, how to work with variable length sentences, what teacher forcing is, and how to train the neural network with early stopping.

Introduction

Components and vocabulary

A neural language model is a neural network that, given a prefix of a sentence, will output the probability that a word will be the next word in the sentence. For example, given the prefix "the dog", the neural network will tell you that "barked" has a high probability of being the next word. This is done by using a recurrent neural network (RNN), such as a long short term memory (LSTM), to turn the prefix into a single vector (that represents the prefix) which is then passed to a softmax layer that gives a probability distribution over every word in the known vocabulary.

In order to be able to still have prefix before the first word (to be able to predict a first word in the sentence) we will make use of an artificial "<BEG>" word that represents the beginning of a sentence and is placed before every sentence. Likewise we will have an artificial "<END>" word that represents the end of sentence in order to be able to predict when the sentence ends.

It is important to note that the words are presented to the neural network as vectors and not as actual strings. The vectors are called word "embeddings" and are derived from a matrix where each row stands for a different word in the vocabulary. This matrix is trained just like the neural network's weights in order to provide the best vector representation for each word.

Training

The probabilities are not given explicitly during training. Instead we just expose the neural network to complete sentences taken from a corpus (we will use the Brown corpus from NLTK) and let it learn the probabilities indirectly. Before training, each word will have an approximately equal probability given any prefix. The training procedure works by inputting a prefix of a sentence at one end of the neural network and then forcing it to increase the probability of the given next word in the sentence by a little bit. Since the probabilities of all the words in the output must add up to 1, increasing the probability of the given next word will also decrease the probability of all the other words by a little bit. By repeatedly increasing the probability of the correct word, the distribution will accumulate at the correct word.

Of course given a prefix there will be several words that can follow and not just one. If each one is increased just a bit in turn repeatedly, all the known correct next words will increase together and all the other words will decrease, forcing the probability distribution to share the peak among all the correct words. If the correct words occur at different frequencies given the prefix, the most frequent word will get the most probability (due to increasing its probability more often) whilst the rest of the correct words will get a fraction of that probability in proportion to their relative frequencies.

The most common way to train neural language models is not actually to use a training set of prefixes and next words, but to use full sentences. Upon being inputted with a sequence of vectors, an RNN will output another sequence of vectors. The nth output vector represents the prefix of the input sequence up to the nth input.

This means that we can present the neural network with a sentence, which gets turned into a sequence of vectors by the RNN, each of which gets turned into a probability distribution by the softmax. The training objective is to make the neural network predict which is the next word in the sentence after every prefix (including the end of sentence token at the end). This is done with all the sentences in the corpus so that the neural network is forced to extract some kind of pattern in the language of the sentences and be able to predict the next word in any sentence prefix.

Teacher forcing

If we always provide a correct prefix during training, then we are using a training method called "teacher forcing", where the neural network is only trained to deal with correct prefixes. This is the simplest method (and the method we will be using in this blog post) but it also introduces a bias to the neural network as it might not always be exposed to correct prefixes. Let's say that the neural network is going to be used to generate sentences by, for example, picking the most probable word given the prefix and then adding the chosen word to the end of the prefix. We can repeatedly do this until we choose the end of sentence token, at which point we should have a complete sentence. The problem with teacher forcing is that if the neural network makes one mistake during the generation process and picks a non-sense word as a most probable next word, then the rest of the sentence will probably also be garbage as it was never trained on sentences with mistakes.

One way to deal with this is to include not only prefixes in the training sentences by also prefixes with some of the words replaced by words chosen by the still-in-training neural network and force it to still give a higher probability to the correct next word. This is called scheduled sampling. Another way to deal with this is to take the training prefixes and some generated prefixes (from the still-in-training neural net) and take their vector representation from the RNN. Generative adversarial training will then be used to make the RNN represent both groups of vectors similarly. This forces the RNN to be fault tolerant to prefixes with errors and to be able to represent them in a way that can lead to correct next words. This is called professor forcing.

Full code listing

This is the full code that you can execute to get a Tensorflow neural language model:

from __future__ import absolute_import, division, print_function, unicode_literals
from builtins import ascii, bytes, chr, dict, filter, hex, input, int, map, next, oct, open, pow, range, round, str, super, zip

import tensorflow as tf
import numpy as np
import random
import timeit
import collections
import nltk

TRAINING_SESSION = True

rand = random.Random(0)
embed_size      = 256
state_size      = 256
max_epochs      = 100
minibatch_size  = 20
min_token_freq  = 3

run_start = timeit.default_timer()

all_seqs = [ [ token.lower() for token in seq ] for seq in nltk.corpus.brown.sents() if 5 <= len(seq) <= 20 ]
rand.shuffle(all_seqs)
all_seqs = all_seqs[:20000]

trainingset_size = round(0.9*len(all_seqs))
validationset_size = len(all_seqs) - trainingset_size
train_seqs = all_seqs[:trainingset_size]
val_seqs = all_seqs[-validationset_size:]

print('Training set size:', trainingset_size)
print('Validation set size:', validationset_size)

all_tokens = (token for seq in train_seqs for token in seq)
token_freqs = collections.Counter(all_tokens)
vocab = sorted(token_freqs.keys(), key=lambda token:(-token_freqs[token], token))
while token_freqs[vocab[-1]] < min_token_freq:
vocab.pop()
vocab_size = len(vocab) + 2 # + edge and unknown tokens

token_to_index = { token: i+2 for (i, token) in enumerate(vocab) }
index_to_token = { i+2: token for (i, token) in enumerate(vocab) }
edge_index = 0
unknown_index = 1

print('Vocabulary size:', vocab_size)

def parse(seqs):
indexes = list()
lens = list()
for seq in seqs:
indexes_ = [ token_to_index.get(token, unknown_index) for token in seq ]
indexes.append(indexes_)
lens.append(len(indexes_)+1) #add 1 due to edge token

maxlen = max(lens)

in_mat  = np.zeros((len(indexes), maxlen))
out_mat = np.zeros((len(indexes), maxlen))
for (row, indexes_) in enumerate(indexes):
in_mat [row,:len(indexes_)+1] = [edge_index]+indexes_
out_mat[row,:len(indexes_)+1] = indexes_+[edge_index]
return (in_mat, out_mat, np.array(lens))

(train_seqs_in, train_seqs_out, train_seqs_len) = parse(train_seqs)
(val_seqs_in,   val_seqs_out,   val_seqs_len)   = parse(val_seqs)

print('Training set max length:', train_seqs_in.shape[1]-1)
print('Validation set max length:', val_seqs_in.shape[1]-1)

################################################################
print()
print('Training...')

#Full correct sequence of token indexes with start token but without end token.
seq_in = tf.placeholder(tf.int32, shape=[None, None], name='seq_in') #[seq, token]

#Length of sequences in seq_in.
seq_len = tf.placeholder(tf.int32, shape=[None], name='seq_len') #[seq]
tf.assert_equal(tf.shape(seq_in)[0], tf.shape(seq_len)[0])

#Full correct sequence of token indexes without start token but with end token.
seq_target = tf.placeholder(tf.int32, shape=[None, None], name='seq_target') #[seq, token]
tf.assert_equal(tf.shape(seq_in), tf.shape(seq_target))

batch_size = tf.shape(seq_in)[0] #Number of sequences to process at once.
num_steps = tf.shape(seq_in)[1] #Number of tokens in generated sequence.

#Mask of which positions in the matrix of sequences are actual labels as opposed to padding.

with tf.variable_scope('prefix_encoder'):
#Encode each sequence prefix into a vector.

#Embedding matrix for token vocabulary.
embeddings = tf.get_variable('embeddings', [ vocab_size, embed_size ], tf.float32, tf.contrib.layers.xavier_initializer()) #[vocabulary token, token feature]

#3D tensor of tokens in sequences replaced with their corresponding embedding vector.
embedded = tf.nn.embedding_lookup(embeddings, seq_in) #[seq, token, token feature]

#Use an LSTM to encode the generated prefix.
init_state = tf.contrib.rnn.LSTMStateTuple(c=tf.zeros([ batch_size, state_size ]), h=tf.zeros([ batch_size, state_size ]))
cell = tf.contrib.rnn.BasicLSTMCell(state_size)
prefix_vectors = tf.nn.dynamic_rnn(cell, embedded, sequence_length=seq_len, initial_state=init_state, scope='rnn')[0] #[seq, prefix, prefix feature]

with tf.variable_scope('softmax'):
#Output a probability distribution over the token vocabulary (including the end token).

W = tf.get_variable('W', [ state_size, vocab_size ], tf.float32, tf.contrib.layers.xavier_initializer())
b = tf.get_variable('b', [ vocab_size ], tf.float32, tf.zeros_initializer())
logits = tf.reshape(tf.matmul(tf.reshape(prefix_vectors, [ -1, state_size ]), W) + b, [ batch_size, num_steps, vocab_size ])
predictions = tf.nn.softmax(logits) #[seq, prefix, token]

losses = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=seq_target, logits=logits) * token_mask
total_loss = tf.reduce_sum(losses)
saver = tf.train.Saver()

sess = tf.Session()

if TRAINING_SESSION:
sess.run(tf.global_variables_initializer())

print('epoch', 'val loss', 'duration', sep='\t')

epoch_start = timeit.default_timer()

validation_loss = 0
for i in range(len(val_seqs)//minibatch_size):
minibatch_validation_loss = sess.run(total_loss, feed_dict={
seq_in:     val_seqs_in [i*minibatch_size:(i+1)*minibatch_size],
seq_len:    val_seqs_len[i*minibatch_size:(i+1)*minibatch_size],
seq_target: val_seqs_out[i*minibatch_size:(i+1)*minibatch_size],
})
validation_loss += minibatch_validation_loss

print(0, round(validation_loss, 3), round(timeit.default_timer() - epoch_start), sep='\t')
last_validation_loss = validation_loss

saver.save(sess, './model')

trainingset_indexes = list(range(len(train_seqs)))
for epoch in range(1, max_epochs+1):
epoch_start = timeit.default_timer()

rand.shuffle(trainingset_indexes)
for i in range(len(trainingset_indexes)//minibatch_size):
minibatch_indexes = trainingset_indexes[i*minibatch_size:(i+1)*minibatch_size]
sess.run(train_step, feed_dict={
seq_in:     train_seqs_in [minibatch_indexes],
seq_len:    train_seqs_len[minibatch_indexes],
seq_target: train_seqs_out[minibatch_indexes],
})

validation_loss = 0
for i in range(len(val_seqs)//minibatch_size):
minibatch_validation_loss = sess.run(total_loss, feed_dict={
seq_in:     val_seqs_in [i*minibatch_size:(i+1)*minibatch_size],
seq_len:    val_seqs_len[i*minibatch_size:(i+1)*minibatch_size],
seq_target: val_seqs_out[i*minibatch_size:(i+1)*minibatch_size],
})
validation_loss += minibatch_validation_loss

if validation_loss > last_validation_loss:
break
last_validation_loss = validation_loss

saver.save(sess, './model')

print(epoch, round(validation_loss, 3), round(timeit.default_timer() - epoch_start), sep='\t')

print(epoch, round(validation_loss, 3), round(timeit.default_timer() - epoch_start), sep='\t')

################################################################
print()
print('Evaluating...')

saver.restore(sess, tf.train.latest_checkpoint('.'))

def seq_prob(seq):
seq_indexes = [ token_to_index.get(token, unknown_index) for token in seq ]
outputs = sess.run(predictions, feed_dict={
seq_in:  [ [ edge_index ] + seq_indexes ],
seq_len: [ 1+len(seq) ],
})[0]
probs = outputs[np.arange(len(outputs)), seq_indexes+[ edge_index ]]
return np.prod(probs)

print('P(the dog barked.) =', seq_prob(['the', 'dog', 'barked', '.']))
print('P(the cat barked.) =', seq_prob(['the', 'cat', 'barked', '.']))
print()

def next_tokens(prefix):
prefix_indexes = [ token_to_index.get(token, unknown_index) for token in prefix ]
probs = sess.run(predictions, feed_dict={
seq_in:  [ [ edge_index ] + prefix_indexes ],
seq_len: [ 1+len(prefix) ],
})[0][-1]
token_probs = list(zip(probs, ['<end>', '<unk>']+vocab))

print('the dog ...', sorted(next_tokens(['the', 'dog']), reverse=True)[:5])
print()

def greedy_gen():
prefix = []
for _ in range(100):
probs = sorted(next_tokens(prefix), reverse=True)
(_, next_token) = probs[0]
if next_token == '<unk>':
(_, next_token) = probs[1]
elif next_token == '<end>':
break
else:
prefix.append(next_token)
return prefix

print('Greedy generation:', ' '.join(greedy_gen()))
print()

def random_gen():
prefix = []
for _ in range(100):
probs = next_tokens(prefix)
(unk_prob, _) = probs[unknown_index]

r = rand.random() * (1.0 - unk_prob)
total = 0.0
for (prob, token) in probs:
if token != '<unk>':
total += prob
if total >= r:
break
if token == '<end>':
break
else:
prefix.append(token)
return prefix

print('Random generation:', ' '.join(random_gen()))
print()


Code explanation

This section explains snippets of the code.

Preprocessing

The first thing we need to do is create the training and validation sets. We will use the Brown corpus from NLTK as data. Since the purpose of this tutorial is to quickly train a neural language model on a normal computer, we will work with a subset of the corpus so that training will be manageable. We will only take sentences that are between 5 and 20 tokens long and only use a random sample of 20000 sentences from this pool. From this we will take a random 10% to be used for the validation set and the rest will be used for the training set.

all_seqs = [ [ token.lower() for token in seq ] for seq in nltk.corpus.brown.sents() if 5 <= len(seq) <= 20 ]
rand.shuffle(all_seqs)
all_seqs = all_seqs[:20000]

trainingset_size = round(0.9*len(all_seqs))
validationset_size = len(all_seqs) - trainingset_size
train_seqs = all_seqs[:trainingset_size]
val_seqs = all_seqs[-validationset_size:]


Next we need to get the vocabulary from the training set. This will consist of all the words that occur frequently in the training set sentences, with the rare words being replaced by an "unknown" token. This will allow the neural network to be able to work with out-of-vocabulary words as they will be represented as "<unk>" and the network will ave seen this token in the training sentences. Each vocabulary word will be represented by an index, with "0" representing the beginning and end token, "1" representing the unknown token, "2" representing the most frequent vocabulary word, "3" the second most frequent word, and so on.

all_tokens = (token for seq in train_seqs for token in seq)
token_freqs = collections.Counter(all_tokens)
vocab = sorted(token_freqs.keys(), key=lambda token:(-token_freqs[token], token))
while token_freqs[vocab[-1]] < min_token_freq:
vocab.pop()
vocab_size = len(vocab) + 2 # + edge and unknown tokens

token_to_index = { token: i+2 for (i, token) in enumerate(vocab) }
index_to_token = { i+2: token for (i, token) in enumerate(vocab) }
edge_index = 0
unknown_index = 1


Finally we need to turn all the sentences in the training and validation set into a matrix of indexes where each row represents a sentence. Since different sentences have different lengths, we will make the matrix as wide as the longest sentence and use 0 to pad each sentence into uniform length (pads are added to the end of the sentences). To know where a sentence ends we will also keep track of the sentence lengths. For training we will need to use an input matrix and a target matrix. The input matrix contains sentences that start with the start-of-sentence token in every row whilst the target matrix contains sentences that end with the end-of-sentence token in every row. The former is used to pass as input to the neural net during training whilst the latter is used to tell which is the correct output of each sentence prefix.

def parse(seqs):
indexes = list()
lens = list()
for seq in seqs:
indexes_ = [ token_to_index.get(token, unknown_index) for token in seq ]
indexes.append(indexes_)
lens.append(len(indexes_)+1) #add 1 due to edge token

maxlen = max(lens)

in_mat  = np.zeros((len(indexes), maxlen))
out_mat = np.zeros((len(indexes), maxlen))
for (row, indexes_) in enumerate(indexes):
in_mat [row,:len(indexes_)+1] = [edge_index]+indexes_
out_mat[row,:len(indexes_)+1] = indexes_+[edge_index]
return (in_mat, out_mat, np.array(lens))

(train_seqs_in, train_seqs_out, train_seqs_len) = parse(train_seqs)
(val_seqs_in,   val_seqs_out,   val_seqs_len)   = parse(val_seqs)


Model definition

The Tensorflow neural network model we shall implement will accept a batch of sentences at once. This means that the input will be a matrix of integers where each row is a sentence and each integer is a word index. Since both the number of sentences and the sentence length are variable we will use "None" as a dimension size. We will then get the size using the "tf.shape" function. The sentences on their own are not enough as an input as we also need to provide the sentence lengths as a vector. The length of this vector needs to be as long as the number of rows in the sequences (one for each sentence). These lengths will be used to generate a mask matrix of 0s and 1s where a "1" indicates the presence of a token in the sequence matrix whilst a "0" indicates the presence of a pad token. This is generated by the "tf.sequence_mask" function.

#Full correct sequence of token indexes with start token but without end token.
seq_in = tf.placeholder(tf.int32, shape=[None, None], name='seq_in') #[seq, token]

#Length of sequences in seq_in.
seq_len = tf.placeholder(tf.int32, shape=[None], name='seq_len') #[seq]
tf.assert_equal(tf.shape(seq_in)[0], tf.shape(seq_len)[0])

#Full correct sequence of token indexes without start token but with end token.
seq_target = tf.placeholder(tf.int32, shape=[None, None], name='seq_target') #[seq, token]
tf.assert_equal(tf.shape(seq_in), tf.shape(seq_target))

batch_size = tf.shape(seq_in)[0] #Number of sequences to process at once.
num_steps = tf.shape(seq_in)[1] #Number of tokens in generated sequence.

#Mask of which positions in the matrix of sequences are actual labels as opposed to padding.


Now that the inputs are handled we need to process the prefixes of the sequences into prefix vectors using an LSTM. We first convert the token indexes into token vectors. This is done using an embedding matrix where the first row of the matrix is the word vector of the first word in the vocabulary, the second for the second word, and so on. This is then passed to an LSTM that is initialized with zero-vectors for both the cell state and the hidden state. We pass in the sequence lengths to keep the RNN from interpreting pad values.

with tf.variable_scope('prefix_encoder'):
#Encode each sequence prefix into a vector.

#Embedding matrix for token vocabulary.
embeddings = tf.get_variable('embeddings', [ vocab_size, embed_size ], tf.float32, tf.contrib.layers.xavier_initializer()) #[vocabulary token, token feature]

#3D tensor of tokens in sequences replaced with their corresponding embedding vector.
embedded = tf.nn.embedding_lookup(embeddings, seq_in) #[seq, token, token feature]

#Use an LSTM to encode the generated prefix.
init_state = tf.contrib.rnn.LSTMStateTuple(c=tf.zeros([ batch_size, state_size ]), h=tf.zeros([ batch_size, state_size ]))
cell = tf.contrib.rnn.BasicLSTMCell(state_size)
prefix_vectors = tf.nn.dynamic_rnn(cell, embedded, sequence_length=seq_len, initial_state=init_state, scope='rnn')[0] #[seq, prefix, prefix feature]


Next we need to take the prefix vectors and derive probabilities for the next word in every prefix. Since the prefix vectors are a 3D tensor and the weights matrix is a 2D tensor we have to first reshape the prefix vectors into a 2D tensor before multiplying them together. Following this we will then reshape the resulting 2D tensor back into a 3D tensor and apply softmax to it.

with tf.variable_scope('softmax'):
#Output a probability distribution over the token vocabulary (including the end token).

W = tf.get_variable('W', [ state_size, vocab_size ], tf.float32, tf.contrib.layers.xavier_initializer())
b = tf.get_variable('b', [ vocab_size ], tf.float32, tf.zeros_initializer())
logits = tf.reshape(tf.matmul(tf.reshape(prefix_vectors, [ -1, state_size ]), W) + b, [ batch_size, num_steps, vocab_size ])
predictions = tf.nn.softmax(logits) #[seq, prefix, token]


Training

Now comes the part that has to do with training the model. We need to add the loss function which is masked to ignore padding tokens. We will take the sum crossentropy and apply the Adam optimizer to it. Crossentropy measures how close to 1.0 the probability of the correct next word in the softmax is. This allows us to maximize the correct probability which is done by using Adam, an optimization technique that works using gradient descent but with the learning rate being adapted for every weight. We would also like to be able to save our model weights during training in order to reuse them later. Finally we create a Tensorflow session variable and use it to initialize all the model weights.

losses = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=seq_target, logits=logits) * token_mask
total_loss = tf.reduce_sum(losses)
saver = tf.train.Saver()

sess = tf.Session()

sess.run(tf.global_variables_initializer())


The second thing we should do is measure how well the randomly initialised neural net performs in terms of the sum crossentropy. In order to avoid running out of memory, instead of putting all the validation set data as input to the neural net we will break it up into minibatches of a fixed size and find the total loss of all the minibatches together. This final loss value will be placed in a variable called "last_validation_loss" in order to be able to track how the loss is progressing as training goes on. The last line is to save the weights of the neural network in the same folder as the code and give the files (there will be several files with different extensions) the file name "model".

validation_loss = 0
for i in range(len(val_seqs)//minibatch_size):
minibatch_validation_loss = sess.run(total_loss, feed_dict={
seq_in:     val_seqs_in [i*minibatch_size:(i+1)*minibatch_size],
seq_len:    val_seqs_len[i*minibatch_size:(i+1)*minibatch_size],
seq_target: val_seqs_out[i*minibatch_size:(i+1)*minibatch_size],
})
validation_loss += minibatch_validation_loss

last_validation_loss = validation_loss

saver.save(sess, './model')


Next we'll do the same thing but on the training set and we'll run the "train_step" operation instead of the "total_loss" operation in order to actually optimize the weights into a smaller "total_loss" value. It is more beneficial to take randomly sampled minibatches instead of just breaking the training set into deterministically chosen groups, so we use an array of indexes called "trainingset_indexes" to determine which training pairs will make it to the next minibatch. We randomly shuffle these indexes and then break it into fixed size groups. The indexes in the next group are used to choose the training pairs are used for the next minibatch. Following this we will again calculate the new loss value on the validation set to see how we're progressing. If the new loss is worse than the previous loss then we stop the training. Otherwise we save the model weights and continue training. This is called early stopping. There is a hard limit set to the number of epochs to run in order to avoid training for too long.

trainingset_indexes = list(range(len(train_seqs)))
for epoch in range(1, max_epochs+1):
rand.shuffle(trainingset_indexes)
for i in range(len(trainingset_indexes)//minibatch_size):
minibatch_indexes = trainingset_indexes[i*minibatch_size:(i+1)*minibatch_size]
sess.run(train_step, feed_dict={
seq_in:     train_seqs_in [minibatch_indexes],
seq_len:    train_seqs_len[minibatch_indexes],
seq_target: train_seqs_out[minibatch_indexes],
})

validation_loss = 0
for i in range(len(val_seqs)//minibatch_size):
minibatch_validation_loss = sess.run(total_loss, feed_dict={
seq_in:     val_seqs_in [i*minibatch_size:(i+1)*minibatch_size],
seq_len:    val_seqs_len[i*minibatch_size:(i+1)*minibatch_size],
seq_target: val_seqs_out[i*minibatch_size:(i+1)*minibatch_size],
})
validation_loss += minibatch_validation_loss

if validation_loss > last_validation_loss:
break
last_validation_loss = validation_loss

saver.save(sess, './model')


Application

We can now use the trained neural network. We first restore the last saved model weights which are the ones that gave the best validation loss and then we will define two functions: one for getting the probability of a whole sequence and the other for getting the next token after a prefix. "seq_prob" works by getting every prefix's softmax output, getting the probability of each token in the sequence after each prefix and then multiplying them together. "next_tokens" works by passing a prefix to the neural network and only getting the softmax output of the last (longest) prefix. The probabilities and corresponding tokens are returned.

saver.restore(sess, tf.train.latest_checkpoint('.'))

def seq_prob(seq):
seq_indexes = [ token_to_index.get(token, unknown_index) for token in seq ]
outputs = sess.run(predictions, feed_dict={
seq_in:  [ [ edge_index ] + seq_indexes ],
seq_len: [ 1+len(seq) ],
})[0]
probs = outputs[np.arange(len(outputs)), seq_indexes+[ edge_index ]]
return np.prod(probs)

print('P(the dog barked.) =', seq_prob(['the', 'dog', 'barked', '.']))
print('P(the cat barked.) =', seq_prob(['the', 'cat', 'barked', '.']))
print()

def next_tokens(prefix):
prefix_indexes = [ token_to_index.get(token, unknown_index) for token in prefix ]
probs = sess.run(predictions, feed_dict={
seq_in:  [ [ edge_index ] + prefix_indexes ],
seq_len: [ 1+len(prefix) ],
})[0][-1]
token_probs = list(zip(probs, ['<end>', '<unk>']+vocab))

print('the dog ...', sorted(next_tokens(['the', 'dog']), reverse=True)[:5])
print()


These are the outputs I got:

P(the dog barked.) = 1.71368e-08
P(the cat barked.) = 6.16375e-10

the dog ... [(0.097657956, 'was'), (0.089791521, '<unk>'), (0.058101207, 'is'), (0.055007596, 'had'), (0.02786131, 'could')]


We can extend "next_tokens" to generate whole sentences using one of two ways: generating the most probable sentence or generating a randomly sampled sentence. For the first we are going to use greedy search which chooses the most probable word given a prefix and adds it to the prefix. This will not give the most probable sentence but it should be close (use beam search for a better selection method). For the second function we want to choose words at random but based on their probability such that rare words are rarely chosen. This is called sampling sentences (the probability of sampling a particular sentence is equal to the probability of the sentence). We did this using roulette selection. For both functions we left out the unknown token during generation and gave a hard maximum word limit of 100 words.

def greedy_gen():
prefix = []
for _ in range(100):
probs = sorted(next_tokens(prefix), reverse=True)
(_, next_token) = probs[0]
if next_token == '<unk>':
(_, next_token) = probs[1]
elif next_token == '<end>':
break
else:
prefix.append(next_token)
return prefix

print('Greedy generation:', ' '.join(greedy_gen()))
print()

def random_gen():
prefix = []
for _ in range(100):
probs = next_tokens(prefix)
(unk_prob, _) = probs[unknown_index]

r = rand.random() * (1.0 - unk_prob)
total = 0.0
for (prob, token) in probs:
if token != '<unk>':
total += prob
if total >= r:
break
if token == '<end>':
break
else:
prefix.append(token)
return prefix

print('Random generation:', ' '.join(random_gen()))
print()


These are the outputs I got:

Greedy generation:  i don't know '' .

Random generation: miss the road place , title comes to the seeds of others to many and details under the dominant home


## Monday, April 3, 2017

### Getting the top n most probable sentences using beam search

This is a continuation from a previous blog post on single sentence beam search.

Sometimes it is not enough to just generate the most probable sentence using a language model. Sometimes you want to generate the top 3 most probable sentences instead. In that case we need to modify our beam search a bit. We will make the function a generator that returns a sequence of sentences in order of probability instead of just returning a single most probable sentence. Here are the changes we need to make:

In the single sentence version, we were getting the most probable prefix in the current beam and checking if it is complete. If it is, then we return it and stop there. Instead, we will now not stop until the current beam is empty (or until the caller stops requesting for more sentences). After returning the most probable prefix we will check the second most probable prefix and keep on returning complete prefixes until we either find one which is not complete or we return all the beam. In the case that we return the whole beam then the algorithm stops there as there is nothing left with which to generate new prefixes. This means that the beam width gives a limit on the number of sentences that can be returned. If we do not return all the beam then we continue generating prefixes with the remainder.

In the case that some complete sentences were returned, they need to also be removed from the beam before we continue generating. Since the beam is implemented as a min-first heap queue (min-first because we want to pop the least probable prefix quickly when the beam becomes bigger than the beam width) then we cannot remove the highest probable complete sentence quickly as well. In order to do this, we first turn the heap into a list which is sorted by probability and then start popping out the items at the end if they are complete sentences. Following this, we will then heapify the list back into a min-first heap queue and continue as normal. This sorting and reheapifying should not impact on the performance too much if the beam width is relatively small.

If the clip length is reached then the whole beam is immediately returned in order of probability. This is because as soon as one prefix is equal to the allowed maximum then that means that the entire beam consists of
1. incomplete sentences that are also as long as the allowed maximum (since all the prefixes grow together)
2. complete sentences that were found before but which do not have a maximum probability
After returning all the incomplete sentences (they have to be returned since they cannot be extended further) then the complete sentences will also be returned as they will become the most probable sentences of what is left in the beam. The assumption is that if the complete sentence was a nonsensical one, then it wouldn't have remained in the beam so might as well return it as well rather than lose it.

Here is the modified Python 3 code:

import heapq

class Beam(object):

def __init__(self, beam_width, init_beam=None):
if init_beam is None:
self.heap = list()
else:
self.heap = init_beam
heapq.heapify(self.heap) #make the list a heap
self.beam_width = beam_width

heapq.heappush(self.heap, (prob, complete, prefix))
if len(self.heap) > self.beam_width:
heapq.heappop(self.heap)

def __iter__(self):
return iter(self.heap)

def beamsearch(probabilities_function, beam_width=10, clip_len=-1):
prev_beam = Beam(beam_width)

while True:
curr_beam = Beam(beam_width)

#Add complete sentences that do not yet have the best
probability to the current beam, the rest prepare to add more words to
them.
for (prefix_prob, complete, prefix) in prev_beam:
if complete == True:
else:
#Get probability of each possible next word for the
incomplete prefix.
for (next_prob, next_word) in probabilities_function(prefix):
if next_word == '<end>': #if next word is the end
token then mark prefix as complete and leave out the end token
else: #if next word is a non-end token then mark
prefix as incomplete
prefix+[next_word])

sorted_beam = sorted(curr_beam) #get all prefixes in current beam sorted by probability
any_removals = False
while True:
(best_prob, best_complete, best_prefix) = sorted_beam[-1] #get highest probability prefix
if best_complete == True or len(best_prefix)-1 ==
clip_len: #if most probable prefix is a complete sentence or has a length that exceeds the clip length (ignoring the start token) then yield it
yield (best_prefix[1:], best_prob) #yield best sentence without the start token and together with its probability
sorted_beam.pop() #remove the yielded sentence and check the next highest probability prefix
any_removals = True
if len(sorted_beam) == 0: #if there are no more sentences in the beam then stop checking
break
else:
break

if any_removals == True: #if there were any changes in the current beam then...
if len(sorted_beam) == 0: #if there are no more prefixes in the current beam (due to clip length being reached) then end the beam search
break
else: #otherwise set the previous beam to the modified current beam
prev_beam = Beam(beam_width, sorted_beam)
else: #if the current beam was left unmodified then assign it to the previous beam as is
prev_beam = curr_beam