Thursday, March 29, 2018

Word Mover's Distance (a text semantic similarity measure)

Word Mover's Distance is a way to measure the similarity between the meaning of two sentences. It's basically a bag of words method (ignores the order of the words in sentences) but it's been found to work really well, including for evaluating automatic caption generation. You can find code to perform k-nearest neighbours classification of sentiment written by the creator of WMD here, but reading the code is a challenge so here's how WMD works.

This measure makes use of an existing distance measure called Earth Mover's Distance. Given a set of heaps of dirt and a set of holes, and both the heaps and the holes have a point as a location (a vector) and a certain mass or capacity (a weight), what is the least amount of work needed to move all the dirt into the holes or to fill all the holes completely (depending on whether there is more total mass or total capacity)? By work we mean the total distance travelled multiplied by the mass carried. This is what EMD calculates. It measures the distance between a set of vectors paired with weights and another set of vectors paired with weights, without requiring that both sets have the same number of vectors. It is a linear programming problem and there is a Python library for it called "pyemd" that can be installed via pip.

Now back to Word Mover's Distance. WMD uses EMD between two sentences. It assumes some kind of pre-computed word embedding vectors such as word2vec, which are vectors that encode the meaning of words. The words in the sentences are the heaps and holes and these word vectors are their locations. For weights, we use the normalized frequency of the words in the sentence, that is, the number of times the word was found in the sentence divided by the total number of words in the sentence. We also ignore any words that are stop words (non-informative common words like "the" and "of") and any words that are not found in the list of word vectors. Once we have collected the set of vectors and weights of the two sentences we then just put them in the EMD solver to get the distance.

The Python library "gensim" implements WMD given a word2vec dataset and two lists of words (the sentences) without stop words. You can find a tutorial on how to use it here.

Monday, February 26, 2018

Finding the Greatest Common Divisor (GCD) / Highest Common Factor (HCF) of two numbers

The greatest common divisor or highest common factor of two numbers can be found using a very simple algorithm described since the time of Euclid, known as the Euclidean algorithm. The nice thing about it is that it can be computed visually by drawing squares in a rectangle as follows.

Let's say that you have a line segment of length 6 units and a smaller line segment of length 2 units.

We can check if the smaller line's length is a divisor of the larger line's length if we can place several of the smaller lines next to each other and reach the exact length of the larger line.

Likewise we can check if a number is a common divisor of two other numbers by checking if a square can fill exactly a rectangle.

In the above diagram, the big blue rectangle has sides of 6 units by 4 units and the small red squares have sides 2 units by 2 units. Since the red square can exactly fill the blue rectangle by placing copies of it next to each other, 2 is a common divisor of 4 and 6.

The greatest common divisor is the largest square that can exactly fill a rectangle. The largest square that can do this has its side equal to the smallest side of the rectangle (otherwise not even one square will fit in the rectangle) whilst the smallest square has its side equal to 1 unit (assuming that sides are always whole number lengths). Let's take the above rectangle as an example and use the Euclidean algorithm to find the largest square that fits a 6 by 4 rectangle.

The Euclidean algorithm takes advantage of the fact that the greatest common divisor of two numbers is also a divisor of the difference of the two numbers. If two numbers "a" and "b" have "c" as a common divisor, then they can be rewritten as "c×p" and "c×q". The difference "a-b" is then equal to "c×p - c×q" which is equal to "c×(p - q)" which is therefore a divisor of "c" since one of its factors is "c". Notice that "c" can be any divisor of "a" and "b", which means that "a-b" should also have the greatest common divisor of "a" and "b" as a divisor.

This means that the largest square to exactly fill the original 4 by 6 square will also be the largest square to exactly fill a 2 by 4 rectangle, since 2 is the difference between 4 and 6.

This piece of information allows us to substitute the difficult problem of finding the largest square that exactly fills a rectangle with an easier problem of finding the same square but on a smaller rectangle. The smaller the rectangle, the easier it is to find the largest square to exactly fill it and the Euclidean algorithm allows us to progressively chop off parts of the rectangle until all that remains is a small square which would be the square that exactly fills the original rectangle.

So the algorithm goes as follows:

  1. If the rectangle has both sides equal then it itself is the square that is the largest square that exactly fills the rectangle and we are done.
  2. Otherwise put a square inside the rectangle of length equal to the shortest side of the rectangle.
  3. Chop off the part of the rectangle that is covered by the square and repeat.

The 6 by 4 rectangle is trivial so let's try this algorithm on a slightly harder problem of 217 by 124. This time we'll be using exact pixels to represent the rectangle lengths.

This is a 217 by 124 rectangle.

This is the overlap it has with a square of length equal to its shortest side.

We now chop off that part of the rectangle and find the largest square which fits the remaining 93 by 124 rectangle.

This is the new square's overlap.

This is the remaining 93 by 31 rectangle.

This is the new square's overlap.

This is the remaining 62 by 31 rectangle.

This is the new square's overlap.

This is the remaining 31 by 31 rectangle.

We have finally reached a square of side 31, which means that 31 is the greatest common divisor of 217 and 124.

Of course we don't really need to visualise anything. We can do this all numerically by just repeatedly subtracting the smallest number from the biggest and replacing the biggest with the difference. Here's a Python function that does that:

def gcd(a, b):
    while a != b:
        if a < b:
            b = b - a
            a = a - b
    return a

We can make this shorter by using pattern matching:

def gcd(a, b):
    while a != b:
        (a, b) = (min(a,b), max(a,b)-min(a,b))
    return a

If one of the numbers is much bigger than the other, this program will waste a lot of time repeatedly subtracting the same number until the bigger number becomes smaller than the other. Fortunately this process can be done in a single operation by dividing and getting the remainder. The remainder is what's left after you can't subtract the second number from the first any more. This speeds the algorithm up considerably. The operation is called the modulo operation. With modulo the subtractions will not stop when both numbers are equal as modulo assumes that you can do one more subtraction and make the first number zero. Instead we will now keep on looping until the remainder is zero, in which case the last number you used to get the remainder was the greatest common divisor:

def gcd(a, b):
    while b > 0:
        (a, b) = (min(a,b), max(a,b)%min(a,b))
    return a

Tuesday, January 30, 2018

A quick guide to running deep learning applications on Amazon AWS virtual server supercomputer

Deep learning is serious business and if you want to work on sizeable problems you're going to need more hardware than you probably have at home. Here is how to use Amazon Web Services in order to be able to upload your code and datasets to an online server which will run it for you and then let you download the results. It's basically a virtual computer that you use over the Internet by sending commands.

First of all, when I got started I was initially following this video on how to use AWS:

AWS Educate

If you're a student then you can take advantage of AWS Educate where if you are accepted you will receive 100 free hours of runtime to get started. 100 hours in deep learning will not last long but it will allow you to experiment and get your bearings with AWS without worrying about the bill too much. It will also be a few days before you get a reply. Here's the link:

Get the applications you need

If you're on Windows, then before creating your server you should install WinSCP and PuTTY in order to be able to upload your data and run your programs. If you're not on Windows then you can use the terminal with ssh to do this.

Create an account

Start by visiting this link and making an account:

You need to provide your credit card details before starting. You will be charged automatically every month so make sure to keep an eye on your usage as you pay for the amount of time you let the server run per hour.

Enter the AWS console

Next go into the AWS console which is where you get to manage everything that has to do with virtual servers:

Note that there is the name of a city at the top such as "Ohio". This is to say where you want your servers to be in. Amazon has servers all around the world and you might prefer one region over another, mostly to reduce latency. Different regions also have different prices, so that might take priority in your decision. Ohio and W. Virginia seem to be the cheapest. See here for more information:

Keep in mind that each region has its own interface so that if you reserve a server in one region, you can only configure your server when that region is selected. You will not see a list of all your servers in any region. So make sure you remember which regions you use.

After having chosen a region, next go to Services and click on EC2:

Create a virtual server

Click on the big blue button called "Launch Instance" in order to create your virtual server. You can now see a list of AMIs which are preconfigured virtual servers that are specialized for some kind of task, such as deep learning. You're going to copy an instance of one of these and then upload your files to it. Look for the AMI called "Deep Learning AMI (Ubuntu)" which contains a bunch of deep learning libraries together with CUDA drivers. Click on "Select".

This is where you choose the computing power you want. The better it is, the more expensive. If you just want to see how it works then choose a free one which says "Free tier eligible". If you want to get down to business then choose one of the "GPU Compute" instances such as "p2.xlarge" which has 12GB of GPU memory. The "p2.xlarge" costs about 90 cents per hour (which starts from the moment you create the instance, not when you start running your programs so it also includes the time it takes to upload your data).

If this is your first time creating a powerful instance then you will first need to ask Amazon to let you use it (this is something they do to avoid people hogging all the resources). Under "Regarding" choose "Service Limit Increase", under "Limit Type" choose "EC2 Instances", and under "Region" choose the region you selected previously. You also have to say something about how you'll use it. After being granted access you can then continue from where we left off.

After ticking the check box next to your selected computer power, click on "Next: Configure Instance Details".

Leave this next step with default settings. Click on "Next: Add Storage".

This is where you choose your storage space. You will need at least 50GB of space in order to store the deep learning AMI but you will need additional space for your datasets and results. Keep in mind that you have to pay per GB per month for storage. If you need frequent access to the hard drive (such as loading minibatches from disk) then you'll need to use an SSD drive which costs about 10 cents per GB per month. Otherwise you can use a magnetic drive which costs about 5 cents per GB per month.

Next click on "Next: Add Tags". This is where you make up some informative tags for your virtual server in order to differentiate it from other virtual servers. You can do something like "Name=Deep learning". If you only have one server then this is not important. Click on "Next: Configure Security Group".

This is where you create some firewall rules to make sure that only your computer has access to the virtual server, even if someone else knows the password. It might be the case that this doesn't work for you and you won't be able to connect at all even from your IP address in which case choose "Anywhere" as a source which will not make any restricts based on IP. Click "Review and Launch".

As soon as you click on "Launch" at the bottom the clock starts ticking and you will start being billed. If you've got the AWS Educate package then you will have 100 free hours but they will start being used up. You can stop an instance any time you want but you will be billed for one hour as soon as it starts, even if you stop it immediately.

If this is your first time using a server in the selected region (the place you want your server to be in) then you will be asked to create a cryptographic private key which is a file that will be used instead of a password to connect to the virtual server. If you're on Windows then you need to use a program that comes with PuTTY called PuTTYgen which converts the .pem file that Amazon sends you to a .ppk file that PuTTY can use. Follow the section called "To convert your private key" in the following link to do this:

Connecting to your virtual server

Now that we have our server ready we need to connect to it, upload our stuff, and get it to run. We're also on the clock so we need to hurry. Start by visiting your list of server instances, which can be found in the side bar when you're in the EC2 dashboard:

Your server should be running. You can stop it by right clicking it and under "Instance state" choosing "Stop". This will stop it from charging you every hour but you will be charged for at least one hour as soon as you start.

Clicking on a server and then clicking on "Connect" will give you information to access the server using PuTTY or ssh.

If you're using Windows, you connect to your server using PuTTY, which you can find out how using this guide:
After connecting using PuTTY you can transfer the configuration to WinSCP by opening WinSCP, going on Tools, and choosing "Import sites". Now you can connect using WinSCP in order to upload and download files as well.

If you're not using Windows then you should use the terminal and follow this guide instead:
You can upload stuff from Linux by using FileZilla or by using the normal directory explorer and clicking CTRL+L to enter an FTP location.

Upload all the files you need including datasets and scripts. You can zip the files before uploading them and then unzip them on the server using the "unzip" Linux command.

DO NOT INSTALL LIBRARIES WITH PIP YET! See the next section first.

Using your virtual server

Finally we are close to start running our scripts. But we still need to do two more things first.

First of all, look at the top of the PuTTY or terminal which tells you how to activate different Anaconda environments. These are Python environments with different libraries available which are connected to CUDA so that you will be able to run on the GPU. In PuTTY or terminal enter "source activate <environment name>". Remember this line to use later.

Now you can install anything you want using pip.

Secondly as soon as you close PuTTY or terminal all running processes will stop (but the instance will still be running so you'll still be paying). What you need to do is to use the application called "screen" which will keep everything running on its own. See this link:

Now you'll need to activate the environment again because screen creates a new session which is disconnected from the previous one.


You're done! You can now start using your server. When you're ready, make sure to stop it and you can even terminate it but that will completely delete the server with all data so only do it when you're really ready, otherwise you'll be wasting money reuploading and rerunning everything.

You can see your current bill by clicking on Services and going on Bill. This will show you your current month's bill as well as previous bills as well. You can ever get a daily break down and forecasts.

Friday, December 29, 2017

An introduction to the Long Short Term Memory (LSTM) RNN and Gated Recurrent Unit (GRU)

In my previous post I talked about the recurrent neural network and said that RNNs are used to encode a sequence of vectors into a single vector. The way it works is that the RNN has a memory, which we call a "state", and this memory gets updated each time a new input is introduced to the RNN. Even if the same input is shown twice, the memory can still be changed significantly because the new state is determined by both the new input and the previous state. After all the inputs have been shown to the RNN, the final state will be a vector that represents everything that the RNN has seen. Of course the RNN cannot remember everything in perfect detail since the state vector has a fixed size whilst the sequence of inputs can be arbitrarily long. This means that the final state will actually be a summary of the whole sequence which only "remembers" important or salient features in the sequence. This process has been likened to the short term memory in animals.

The simple RNN described in the previous post, when using sigmoid or hyperbolic tangent as an activation function, tends to suffer from "vanishing gradients" when trained to work with long sequences. What does it mean to have vanishing gradients?

In order to train your RNN you need to find the gradient of the error with respect to each weight in order to know whether to increase or decrease the weight. Now let's assume that we're using $tanh$ as an activation function for computing the next state. The function that turns an input and a state into a new state is something like the following:

$s^1 = tanh(i^1 w_i + s^0 w_s)$

We're leaving out the bias term in order to keep things short. If you want to find the gradient of $s^1$ with respect to $w_i$ (for a sequence length one), that gradient will be

$\frac{ds^1}{dw_i} = (1 - tanh^2(i^1 w_i + s^0 w_s))(i^1)$

Notice how the gradient involves a multiplication with $tanh^2$ of the input. $tanh$ gives a number between 1 and -1 which is a fraction and squaring a fraction makes it smaller. So the gradient might be a pretty small number.

What happens when we have a sequence of two inputs?

$s^2 = tanh(i^2 w_i + tanh(i^1 w_i + s^0 w_s) w_s)$

The gradient will be:

$\frac{ds^2}{dw_i} = (1 - tanh^2(\dots))(i^2 + (1 - tanh^2(\dots))(i^1))$

Notice how now we've multiplied the already small number produced by $tanh^2$ by another number produced by $tanh^2$ before getting to our first input $i^1$. Multiplying a fraction by another fraction results in a fraction that's smaller than both fractions. This means that $i^1$ will contribute significantly less to the gradient than $i^2$. If we have a sequence of 3 inputs then the first input would require 3 fractional multiplications which makes it even smaller. This will eventually make the gradient negligible (it vanishes) and hence result in the RNN only learning to reduce the error with respect to recent inputs only, completely forgetting inputs towards the beginning of the sequence.

One solution for this problem is to use activation functions which do not result in fractional multiplications for each time step, such as the rectified linear unit function (ReLU). $ReLU$ is a function that returns numbers unchanged if they're positive whilst replacing negative numbers with zero:

$ReLU(x) = max(x, 0)$

The gradient of this function is either 1 (if $x$ is positive) or 0 (if $x$ is negative). In mathematical terms this is equivalent to the indicator function $1_{x>0}$ which gives 1 when the subscript condition is met and 0 otherwise. So if we replace our $tanh$ with $ReLU$ for our previous equation we'll get the following:

$s^2 = ReLU(i^2 w_i + ReLU(i^1 w_i + s^0 w_s) w_s)$
$\frac{ds^2}{dw_i} = 1_{i^2 w_i + ReLU(i^1 w_i + s^0 w_s) w_s > 0}(i^2 + 1_{i^1 w_i + s^0 w_s > 0}(i^1))$

You might be worried about how a single failed condition in the indicator functions will result in the making all previous time steps being multiplied by 0 which will vanish them completely. But keep in mind that $i$ and $s$ are not single numbers but vectors of numbers, and that the $ReLU$ function will work on each number in the vectors independently. So whilst some numbers in the vectors will be negative, others will not. With large enough vectors it is unlikely that an entire vector will consist of just negative numbers so you're likely to have a number of cases with only multiplications by 1 (never 0) which will preserve at least parts of the vectors of early inputs.

Although simple RNNs with ReLUs have been reported to give good results in some papers such as this, a more complex form of RNN called a long short term memory (LSTM) RNN, which was designed to reduce the problem of vanishing gradients, is much more popular. According to one of the authors of the LSTM (Jurgen Schmidhuber), its name means that it is an RNN with a short term memory that is long.

The idea is to make the state pass through to the next state without going through an activation function. You pass the input through an activation function, but the previous state is just added to it linearly. So instead of having this state update:

$s^1 = tanh(i^1 w_i + s^0 w_s)$

the LSTM has this state update:

$s^1 = tanh(i^1 w_i) + s^0$

This changes how the gradient changes as the sequence gets longer because now we don't have nested activation functions any more. Instead this is what you end up with when you add another time step:

$s^2 = tanh(i^2 w_i) + tanh(i^1 w_i) + s^0$

This equation perfectly preserves the gradients as it gets longer since all you're doing is adding another term but it also loses a lot of its expressive power since the order in which you present the items in a sequence doesn't matter any more. You'll get the same state vector regardless of how you shuffle the sequence.

What we need is a mixture of the expressiveness of the simple RNN mentioned in the beginning and gradient preservation of this new function. The solution is to have two state vectors: one for expressiveness called the "hidden state" $h$ and one for gradient preservation called the "cell state" $c$. Here is the new equation:

$c^1 = tanh(i^1 w_i + h^0 w_h) + c^0$
$h^1 = tanh(c^1)$

Notice the following things:
  • The order of the inputs matters now because each input is combined with a different hidden state vector.
  • The hidden state vector is derived from the cell state vector by just passing the cell state through a $tanh$.
  • Even though the hidden state is derived from the cell state, $h^0$ and $c^0$ are separate constants and $h^0$ is not equal to $tanh(c^0)$.

Let's derive the equation for the final state of a length 2 sequence step by step:

$c^2 = tanh(i^2 w_i + h^1 w_h) + c^1$
$h^2 = tanh(c^2)$

$c^2 = tanh(i^2 w_i + tanh(c^1) w_h) + tanh(i^1 w_i + h^0 w_h) + c^0$

$c^2 = tanh(i^2 w_i + tanh(tanh(i^1 w_i + h^0 w_h)) w_h) + tanh(i^1 w_i + h^0 w_h) + c^0$

You can see that what is happening is that you end up with a separate term for each input where the input of that term is just inside a $tanh$ and no deeper. On the other hand each term also consists of all the previous inputs but at much lower levels of influence since each previous input is within an additional two nested $tanh$ functions the further back in the sequence it is found. This allows for a distinction in the order of the inputs whilst keeping each term focused on one particular input.

The creators of the LSTM didn't stop there however. They also introduced gating functions in the LSTM. A gating function is basically a single layer sub neural net that outputs a fraction between 0 and 1 using a sigmoid. This number is then multiplied by another number in order to either leave the second number unchanged or turn it to zero. In other words, if the gate is 1 then it allows the second number to pass whilst if it is 0 then it blocks the second number (and gates in between will make the number smaller). Originally there were two gates: one for the input terms $tanh(i^1 w_i + h^0 w_h)$ and one for the new hidden state $tanh(c^1)$. The gates control whether to accept the new input (the input gate) and whether to allow previous information to be represented by the hidden state (the output gate). Later, another paper introduced a forget gate that regulates the cell state as well. All of these gates are controlled by sub neural nets that take a decision based on the current input and the previous hidden state.

Here is what the final LSTM equation looks like:

$g_f = sig(i^{t} w_{g_{f}i} + h^{t-1} w_{g_{f}h} + b_{g_f})$
$g_i = sig(i^{t} w_{g_{i}i} + h^{t-1} w_{g_{i}h} + b_{g_i})$
$g_o = sig(i^{t} w_{g_{o}i} + h^{t-1} w_{g_{o}h} + b_{g_o})$
$c^{t} = g_i \odot tanh(i^{t} w_{ci} + h^{t-1} w_{ch} + b_{c}) + g_f \odot c^{t-1}$
$h^{t} = g_o \odot tanh(c^{t})$

where $g_f$, $g_i$, and $g_o$ are the forget gate, input gate, and output gate respectively. $w_{g_{f}i}$ means weight for the input being used in the calculation of $g_f$. $b$ are the bias terms of the sub neural nets (we didn't use biases in the previous equations to keep things short). $\odot$ is the element-wise product of two vectors and $sig$ stands for sigmoid.

Here is a diagram illustrating the different components of the LSTM with gating function application (element-wise multiplication) being represented by triangles:

A little while before, you might have been a little sceptical about using two different state vectors in order to combine the expressivity of the simple RNN with the gradient preservation of the addition based RNN described above. Can't we just use the cell state in the $tanh$ as well? In fact there was later a paper describing another kind of RNN called a gated recurrent unit (GRU) which does just that. Instead of using one state for differentiating between time steps and another for flowing previous states into later calculations, only one state is used. Gating functions are still used, except that the input gate is replaced with 1 minus the forget gate. Basically it finds a compromise between either forgetting previous states and focussing everything on the new input or ignoring the new input and preserving the previous states. Finally, instead of having an output gate we now have a reset gate which is used to control how much the state should contribute to the input term. Here is what the GRU equation looks like:

$g_f = sig(i^{t} w_{g_{f}i} + h^{t-1} w_{g_{f}h} + b_{g_f})$
$g_i = 1 - g_f$
$g_r = sig(i^{t} w_{g_{r}i} + h^{t-1} w_{g_{r}h} + b_{g_r})$
$h^{t} = g_i \odot tanh(i^{t} w_{hi} + w_{hh}(g_r \odot h^{t-1}) + b_{h}) + g_f \odot h^{t-1}$

And here is the diagram:

Wednesday, November 29, 2017

An introduction to recurrent neural networks: Simple RNNs

Traditional feed forward neural networks work by taking a single fixed-length vector of inputs and producing a single fixed-length vector of outputs. After being trained on a training set, the neural network should be able to not only map inputs in the training set to their correct outputs, but also do so with new unseen inputs. The network is able to generalise to new inputs, but the new inputs must be of the same size. The network is not to generalise across complexity. For example if you train the network to perform addition on 4 digit numbers, it will not be able to perform addition on 5 digit numbers or even 3 digit numbers. Likewise if it learns to understand 5 word sentences then it will not be able to do anything with 6 word sentences. With images we usually solve this by resizing differently sized images into a standard size. This is not as straight forward to do with things like sentences. This is where recurrent neural networks come in.

Recurrent neural networks (RNNs) are used to give a neural network a short term memory which is used to be able to read a sequence of inputs and remember just a summary of what is important in the sequence. This summary, which would be a fixed-length vector called a state, can then be used by the rest of the neural network as usual. It can be used to predict the next word in a partial sentence or to determine the sentiment of a sentence. A simple RNN is a feed forward neural network where neurons in a layer have extra connections that loop around to the same layer as shown below:

The figure shows a neural network consisting of 2 inputs and a state of neurons. The red connections allow each state neuron to produce an output based on a combination of the input neurons and the state neurons themselves. In other words the next state vector is produced based on a combination of the current input and previous state. This is the basis of short-term memory and the result is that after being exposed to a number of inputs, the state will be a vector that is influenced by each of the input vectors. The point is to train the neural network to remember what is important according to the task at hand. This means that we also need to use the final state (after processing all inputs) to generate an output (the state itself is not usually a useful output) which will make the RNN learn a useful representation of the input sequence.

How are the inputs and the recurrent connections combined together? By concatenating the input and state vectors together and then passing them through a weight matrix as usual, generating the next state vector.

But what happens for the first input? What's the first input going to concatenated with if there is no previous state? We have to define a default initial state for the first input. This is usually the all zeros vector but you can instead learn a constant vector that gets optimized during training.

Great, so that's all the basics sorted. Now for the formal notation. It is common to use the terminology of time series when talking about RNNs such that each input in a sequence belongs to a different time step in a series. In our formal notation, let's use superscripts to refer to time steps such that the first time step is 1 and the number of time steps is $T$.

$$s^0 = \mathbf{0}$$
$$s^t = f_s([s^{t-1} i^t] W_s + b_s)$$
$$o = f_o(s^T W_o + b_o)$$

where $s^t$ is the state vector at time $t$, $i^t$ is the input vector at time $t$, $o$ is the output vector, $\mathbf{0}$ is the all zeros vector, $f_s$ and $f_o$ are the activation functions of the state vector and output vector respectively, $W_s$ $b_s$ $W_o$ and $b_o$ are the weights and biases of the state vector and output vector respectively.

The question is how to learn the parameters of a recurrent function, which is not as single as with a feed forward neural network. The first thing we need to do is to unroll the recurrent network into a linear network that reuses the same parameters throughout. Although an RNN can be unrolled to infinity it is also the case that you only need to unroll it as far as your input sequences require. So if your training set contains an input sequence that is 3 time steps long, then you can use an RNN that is unrolled 3 times like this:

Now it makes more sense and we can view it as a feed forward neural network. In fact an RNN is a feed forward network with the constraint that corresponding weights across time steps have to be identical. Notice how $W_{s00}$ and $W_{i00}$ are repeated with every time step. So whereas the weights in different layers in a feed forward neural net can be different, in an RNN they have to be the same. You might be thinking about how to handle the other input sequences of different lengths in the training set. We'll get to that later. For now let's assume that all sequences in the training set are grouped by length and that each minibatch consists of same length sequences.

Since training will involve finding the gradient of the loss with respect to each weight, let's start by finding the gradient of the output with respect to a sample of weights. If you're not familiar with the back propagation algorithm and how gradients are used to train neural networks in general you should check out this previous blog post before continuing on.

Let's start with the gradient with respect to a non-recurrent weight in the output:

$$\frac{do_0}{dW_{o00}} = \frac{d}{dW_{o00}}f_o(W_{o00}s_0^3 + W_{o10}s_1^3) = f_o'(\ldots)s_0^3$$

That was straight forward. What about for recurrent weights?

$$\frac{do_0}{dW_{s00}} = \frac{d}{dW_{s00}}f_o(W_{o00}s_0^3 + W_{o10}s_1^3)$$
$$= f_o'(\ldots)(\frac{d}{dW_{s00}}f_s(W_{s00}s_0^2 + W_{s10}s_1^2 + W_{i00}i_0^3 + W_{i10}i_1^3) + \frac{d}{dW_{s00}}f_s(W_{s01}s_0^2 + W_{s11}s_1^2 + W_{i01}i_0^3 + W_{i11}i_1^3))$$
$$= f_o'(\ldots)(f_s'(\ldots)(\frac{d}{dW_{s00}}W_{s00}s_0^2))$$

And it is at this point that we will realise that things are more complicated than usual with feed forward neural nets. This is because $s_0^2$ can be decomposed to reveal more terms that contain $W_{s00}$ which means that we need to use the product rule.

$$= f_o'(\ldots)(f_s'(\ldots)(s_0^2\frac{d}{dW_{s00}}W_{s00} + W_{s00}\frac{d}{dW_{s00}}s_0^2))$$
$$= f_o'(\ldots)(f_s'(\ldots)(s_0^2 + W_{s00}\frac{d}{dW_{s00}}f_s(W_{s00}s_0^1 + W_{s10}s_1^1 + W_{i00}i_0^2 + W_{i10}i_1^2)))$$

...and so on, which would require as many decompositions as the length of the sequence. This is not compatible with the back propagation algorithm as it's not easy to extract a pattern that works for any sequence length. Keep in mind that we need to do this for the input weights as well.

Fortunately there is a simple solution: Treat all weights as being different and then add together the corresponding derivatives. What this means is that you put a superscript on each weight which indicates the time step it belongs to, hence making each weight different. So instead of having $W_{s00}$ we'd have $W_{s00}^3$, $W_{s00}^2$ and $W_{s00}^1$. Then we find the derivatives of each separate weight and finally add them all up:

$$\frac{do_0}{dW_{s00}} = \frac{do_0}{dW_{s00}^3} + \frac{do_0}{dW_{s00}^2} + \frac{do_0}{dW_{s00}^1}$$

This allows us to use normal back propagation to find each individual weight as if we were working on a feed forward neural net and then finally just add together corresponding derivatives in order to keep the weights identical. Notice that this is not a hack to force the weights to remain identical. The sum of the subderivatives really does equal $\frac{do_0}{dW_{s00}}$. You can try proving it yourself by trying to find the derivative using product rules as I was doing before. This trick is called back propagation through time.

We now get back to the question of handling variable length sequences in a training set. The solution to this is to make all sequences of equal length by padding them with pad vectors (the all zeros vector for example) and then make the RNN simply return the current state unmodified if the input is a pad vector. That way the state will remain the same beyond the length of the sequence, as if there were no pad vectors. This is the new RNN equation:

$$s^0 = \mathbf{0}$$
$$s^t =
f_s([s^{t-1} i^t] W_s + b_s) & \quad \text{if } i^t \text{ is not a pad}\\
s^{t-1} & \quad \text{otherwise}
$$o = f_o(s^T W_o + b_o)$$

You can now see how to implement a language model which predicts the next word in a partial sentence using Tensorflow by checking this previous blog post. You might also want to learn about how to represent words as vectors using word embeddings in this other blog post.

Saturday, October 21, 2017

Hyperparameter tuning using Hyperopt

One of the most tedious but important things there is in machine learning is tuning the hyperparameters of your machine learning algorithm, such as the learning rate and initial parameters in gradient descent. For example you might want to check how your gradient descent algorithm performs when the learning rate is 1.0, 0.1, or 0.01 and the initial parameters being randomly chosen between -10.0 and 10.0 or between -1.0 and 1.0.

The problem with doing this is that unless you have several supercomputers at your disposal, testing the learning algorithm is a slow process and there are so many different ways to configure your hyperparameters. In the above example you'd have to try six different configurations: (1.0,(-10.0,10.0)), (1.0,(-1.0,1.0)), (0.1,(-10.0,10.0)), etc.. An exhaustive search (grid search) would take too long if each test takes very long and in practice you'd have a search space which is much bigger than six. We can test a random sample of the configurations but ideally the sample would be chosen intelligently rather than randomly. We could start out by trying some randomly chosen configurations and then start homing in on some of the more promising hyperparameter choices.

This is where the Python library Hyperopt comes in. It's a simple library that searches a space of hyperparameters in a rational way so that you'll end up with a better result after trying 100 configurations than if you just randomly sampled 100 different configurations and picked the best performing one. It does this by using a tree of Parzen Estimators.

Let's start with a simple gradient descent example that finds the minimum of "y = x^2". We'll write a function that takes in a learning rate, and a range within which to initialize "x" (we'll only ask for the maximum of this range and assume that the negative of this number is the minimum). We'll then apply gradient descent on the initial "x" value for 10 epochs. The function will finally return the value of "x" that was found near the minimum.

import random

def loss(x):
    return x**2

def loss_grad(x):
    return 2*x

def graddesc(learning_rate, max_init_x):
    x = random.uniform(-max_init_x, max_init_x)
    for _ in range(10):
        x = x - learning_rate*loss_grad(x)
    return x

Great, now we want to find the best hyperparameters to use. In practice we'd have a validation set for our machine learning model to learn to perform well on. Once the hyperparameters that result in the best performance on the validation set are found, we'd apply them to learn on a separate test set and it is this performance that is used to judge the quality of the learning algorithm. However, for this blog post we'll instead focus on the simpler mathematical optimization problem of finding the minimum of "y = x^2".

This is how you use Hyperopt to find the best hyperparameter combination. First you create a function called an objective function that takes in a tuple with the chosen hyperparameters and returns a dictionary that is read by hyperopt to assess the fitness of the chosen hyperparameters. Then you take this function and the possible hyperparameter choices you want to allow and pass them to the hyperopt function called "fmin" which finds the hyperparameters that give the smallest loss.

import hyperopt

def hyperopt_objective(hyperparams):
    (learning_rate, max_init_x) = hyperparams
    l = loss(graddesc(learning_rate, max_init_x))
    return {
            'loss':   l,
            'status': hyperopt.STATUS_OK,

best = hyperopt.fmin(
        space     = [
                hyperopt.hp.choice('learning_rate', [ 1.0, 0.1, 0.01 ]),
                hyperopt.hp.choice('max_init_x',    [ 10.0, 1.0 ]),
        algo      = hyperopt.tpe.suggest,
        max_evals = 10

The output of this program is this:
{'learning_rate': 1, 'max_init_x': 1}
This is saying that the best loss is obtained when the learning rate is the item at index 1 in the given list (0.1) and the maximum initial value is the item at index 1 in the given list (1.0). The "space" parameter in fmin is there to say how to construct a combination of hyperparameters. We specified that we want a list of two things: a learning rate that can be either 1.0, or 0.1, or 0.01, and a maximum initial value that can be either 10.0 or 1.0. We use "hp.choice" to let fmin choose among a list. We could instead use "hp.uniform" in order to allow any number within a range. I prefer to use a list of human friendly numbers instead of allowing any number so I use the choice function instead. We have also said that we want to allow exactly 10 evaluations of the objective function in order to find the best hyperparameter combination.

Although this is how we are expected to use this library, it is not very user friendly in general. For example there is no feedback given throughout the search process so if each evaluation of a hyperparameter combination takes hours to complete then we could end up waiting for several days without seeing anything, just waiting for the function to return a value. The return value also requires further processing as it's just indexes. We can make this better by adding some extra stuff to the objective function:

eval_num = 0
best_loss = None
best_hyperparams = None
def hyperopt_objective(hyperparams):
    global eval_num
    global best_loss
    global best_hyperparams

    eval_num += 1
    (learning_rate, max_init_x) = hyperparams
    l = loss(graddesc(learning_rate, max_init_x))
    print(eval_num, l, hyperparams)

    if best_loss is None or l < best_loss:
        best_loss = l
        best_hyperparams = hyperparams

    return {
            'loss':   l,
            'status': hyperopt.STATUS_OK,

We can now see each hyperparameter combination being evaluated. This is the output we'll see:
1 1.6868761146697238 (1.0, 10.0)
2 0.34976768426779775 (0.01, 1.0)
3 0.006508209785146999 (0.1, 1.0)
4 1.5999357079405185 (0.01, 10.0)
5 0.2646974732349057 (0.01, 1.0)
6 0.5182259594937579 (1.0, 10.0)
7 53.61565213613977 (1.0, 10.0)
8 1.8239879002601682 (1.0, 10.0)
9 0.15820396975495435 (0.01, 1.0)
10 0.784445725853568 (1.0, 1.0)

Also, the variable best_hyperparams will contain the tuple with the best hyperparameters found. Printing best_hyperparams will show "(0.1, 1.0)". We can even save the best hyperparameters found till now in a file so that we can stop the search early if we run out of patience.

Here is the full code in one place:

import random
import hyperopt

def loss(x):
    return x**2

def loss_grad(x):
    return 2*x

def graddesc(learning_rate, max_init_x):
    x = random.uniform(-max_init_x, max_init_x)
    for _ in range(10):
        x = x - learning_rate*loss_grad(x)
    return x

eval_num = 0
best_loss = None
best_hyperparams = None
def hyperopt_objective(hyperparams):
    global eval_num
    global best_loss
    global best_hyperparams

    eval_num += 1
    (learning_rate, max_init_x) = hyperparams
    l = loss(graddesc(learning_rate, max_init_x))
    print(eval_num, l, hyperparams)

    if best_loss is None or l < best_loss:
        best_loss = l
        best_hyperparams = hyperparams

    return {
            'loss':   l,
            'status': hyperopt.STATUS_OK,

        space     = [
                hyperopt.hp.choice('learning_rate', [ 1.0, 0.1, 0.01 ]),
                hyperopt.hp.choice('max_init_x',    [ 10.0, 1.0 ]),
        algo      = hyperopt.tpe.suggest,
        max_evals = 10


To find out more about Hyperopt see this documentation page and the Github repository.

Monday, September 18, 2017

Find an equation that passes through a set of points (Lagrange polynomial interpolation)

When I was in school, I remember learning about how to find the equation of a line that passes through two points. When I asked how to find the equation of a curve instead I was told that this was not possible, probably because there is an infinite number of curves that pass through a finite set of points (as we'll see below). However I would have been happy to learn about this simple method to produce a polynomial curve that passes through any given set of points. In general this task is called interpolation and the particular interpolating method we shall see here is called Lagrange polynomials.

A polynomial is an equation of the form

$$a_n x^n + a_{n-1} x^{n-1} + \dots + a_1 x + a_0$$

For example $2x^3 + 4x^2 - 3x + 1$ is a polynomial where 2 is $a_3$, 4 is $a_2$, 4 is $a_1$ and 1 is $a_0$. A polynomial plus another polynomial is also a polynomial, for example $(x^2 + 2x) + (3x + 2) = x^2 + 5x + 2$. A polynomial times another polynomial is also a polynomial, for example $(x^2 + 2x)(3x + 2) = 3x^3 + 8x^2 + 4x$.

Now on to polynomial interpolation. Let's start with the simplest case, when you have only one point. Let's say you want a polynomial that passes through (3,2), that is, when $x$ is 3, the polynomial should equal 2. In this case our equation can simply be

$$y = 2$$

that is, the polynomial is just 2. The equation is always 2 regardless of where $x$ is so we have found our polynomial.

Graph of equation $y = 2$.

Now on to a more interesting case. Let's say we want to pass through these points:
  • (3,2)
  • (4,3)
  • (6,4)

The trick is to add up a set of polynomials each of which goes through one point. We need a polynomial that equals 2 when $x$ is 3, another polynomial that equals 3 when $x$ is 4, and another polynomial that equals 4 when $x$ is 6. But we have to be careful as we don't want these three polynomials to interfere with each other after being added up together. For example if we can't just do $y = (2) + (3) + (4)$ as the two terms will interfere with each other and the result will not pass through any of the points. Instead we need each polynomial to have a shape such that each one passes through its corresponding point but then passes through 0 in place of the remaining points. The three polynomials we need to add up together are:
  • Polynomial corresponding to (3,2): must equal 2 when $x$ is 3 and equal 0 when $x$ is 4 or 6.
  • Polynomial corresponding to (4,3): must equal 3 when $x$ is 4 and equal 0 when $x$ is 3 or 6.
  • Polynomial corresponding to (6,4): must equal 4 when $x$ is 6 and equal 0 when $x$ is 3 or 4.
Since the polynomials equal 0 where the other points are then they will not interfere with each other where the polynomials pass through their corresponding point.

Let's focus on the zeros first. There is an easy trick to ensure that a polynomial goes through zero at certain values of $x$. If you want your polynomial to be zero when $x$ is 4 or 6, then use $(x-4)(x-6)$. In this polynomial, when $x$ is 4 then the first bracket will equal 0, which makes the whole thing 0, and when $x$ is 6 the second bracket will equal 0 which will also make the whole thing 0. This is what $y = (x-4)(x-6)$, $y = (x-3)(x-6)$ and $y = (x-3)(x-4)$ looks like:

Graph of equation $y = (x-4)(x-6)$. Passes through zero when $x$ is 4 or 6.

Graph of equation $y = (x-3)(x-6)$. Passes through zero when $x$ is 3 or 6.

Graph of equation $y = (x-3)(x-4)$. Passes through zero when $x$ is 3 or 4.

See how each curve passes through zero at every point except one? That exception point will be the corresponding point of each polynomial. Adding these polynomials up will not make them interfere with each other at the x-values of each point. But in order to get the resultant polynomial to pass through the points, we need each separate polynomial to pass through its corresponding point whilst still being zero at every other point. For this we apply another easy trick which is to multiply the polynomials by a number. Multiplying an equation by a number will not change where it is zero. It will change the shape of the curve but the places at which it is zero will not be moved. This is what $(x-4)(x-6)$ looks like after being multiplied by 2, 0.5, and -1:

Graph of equation $y = 2(x-4)(x-6)$. Passes through zero when $x$ is 4 or 6.

Graph of equation $y = 0.5(x-4)(x-6)$. Passes through zero when $x$ is 4 or 6.

Graph of equation $y = -(x-4)(x-6)$. Passes through zero when $x$ is 4 or 6.

So we just need to find the number that when multiplied by each separate polynomial will make it pass through its corresponding point. This number is the y-value of the corresponding point divided by the current value of the polynomial there. What we're doing is first making the polynomial equal 1 at its corresponding point and then multiplying it by whatever number we want it to be. For example if you want to multiply the number 2 by a number that will make the result 3, that number should be $\frac{3}{2}$, that is, $2 \times \frac{3}{2} = 3$.

So what is the current value of $(x-4)(x-6)$ at its corresponding point, (3,2)? It's $(3-4)(3-6) = 3$. So by multiplying $(x-4)(x-6)$ by $\frac{2}{(3-4)(3-6)}$ we can make the polynomial pass through 2, without changing where it is zero. This is what $y = (x-4)(x-6)\frac{2}{(3-4)(3-6)}$, $y = (x-3)(x-6)\frac{3}{(4-3)(4-6)}$ and $y = (x-3)(x-4)\frac{4}{(6-3)(6-4)}$ looks like:

Graph of equation $y = (x-4)(x-6)\frac{2}{(3-4)(3-6)}$. Passes through corresponding point (3,2) but through zero when $x$ is 4 or 6.

Graph of equation $y = (x-3)(x-6)\frac{3}{(4-3)(4-6)}$. Passes through corresponding point (4,3) but through zero when $x$ is 3 or 6.

Graph of equation $y = (x-3)(x-4)\frac{4}{(6-3)(6-4)}$. Passes through corresponding point (6,4) but through zero when $x$ is 3 or 4.

Now we can add them up into
$$y = (x-4)(x-6)\frac{2}{(3-4)(3-6)} + (x-3)(x-6)\frac{3}{(4-3)(4-6)} + (x-3)(x-4)\frac{4}{(6-3)(6-4)}$$
which is simplified into $-\frac{1}{6} x^2 + \frac{13}{6} x - 3$

Final graph that passes through the 3 points.

If we added another point at (10,7) then the polynomial passing through all the points would be
$$y = \frac{2(x-4)(x-6)(x-10)}{(3-4)(3-6)(3-10)} + \frac{3(x-3)(x-6)(x-10)}{(4-3)(4-6)(4-10)} + \frac{4(x-3)(x-4)(x-10)}{(6-3)(6-4)(6-10)} + \frac{7(x-3)(x-4)(x-6)}{(10-3)(10-4)(10-6)}$$

Graph that passes through the 3 points plus (10,7).

See how we have found another curve that also passes through the previous 3 points? This equation building method proves that you can find an infinite number of polynomials that pass through a finite number of points, since you can always make a polynomial that passes through the given set of points plus any other point anywhere where each position of the new point requires a differently shaped curve. Since there is an infinite amount of positions where to place the extra point, there is an infinite number of polynomials (and thus curves) that can pass through the given set of points.

In general, given points $(x_1,y_1), (x_2,y_2), \dots, (x_n,y_n)$, your polynomial will be
$$\sum^{n}_{i=1} y_i \frac{\prod^{n}_{j=1,j\neq i} (x-x_j)}{\prod^{n}_{j=1,j\neq i} (x_i-x_j)}$$

Notice that you cannot have two points with the same x-value or you'll get a division by zero. Notice also how you are not guaranteed to get an extrapolation from your points. The polynomial will completely derail into an upward or downward curve beyond the range of the given points, regardless of any pattern suggested by the points. Finally your equation will get more and more complicated with every new point, even after simplification. If you have $n$ points then you will have up to $n$ terms in your polynomial. It might be simplifiable into less terms, but that is rare.