## Wednesday, September 28, 2016

### Text generation using a language model

A language model is something that tells you the probability of a sequence of words. For example it tells you that the sequence "a dog is barking" is more probable than "a dog is was". This probability can then be used to generate text by selecting a sequence of words that is probable.

One way to do this is to take a huge amount of text and count how often each 4 word segment occurs, then divide by the total amount of 4 word segments. For example, you find that in your huge text, "a dog is barking" to occurs 42 times and divide it by the amount of 4 word segments in the text. This will give you an approximate probability of "a dog is barking".

Usually language models give you the probability of a particular word following a prefix of a sentence. For example given the prefix "a dog is", what is the probability that the next word in the prefix is "barking"? This is expressed mathematically as P("barking" | "a dog is"). This is what neural network language models do. You give them a prefix of words and they output a number for each known word, where the number is the probability of that word following the prefix.

Once you have a probability distribution for all the words in your vocabulary, you can then pick the word with the highest probability to continue the incomplete sentence. If you do this for every word, you will generate a whole sentence but you will always generate the same sentence. In order to generate a random sentence, we need to choose a random next word, but in such a way that highly probable words are more likely to be chosen in order to avoid generating gibberish.

The softmax function
Given a set of numbers, such as frequencies, we can turn these numbers into probabilities by using the maximum likelihood estimation as described above. This is when you divide a frequency by the total. However this isn't the best way to do things. First of all this assumes that words that don't occur in the text have a probability of zero, which is unlikely to be the case as it's more likely that the word just has a very small probability than a probability of zero. Second of all, this requires the number to be frequencies, that is, to be all greater than or equal to zero, which might be too limiting. A neural network's outputs can be used to quantify how strongly it believes a particular word should follow a prefix, but this quantity might be negative in order to avoid being limited by a lower bound. Thirdly, we might want to skew the probabilities so that the most probable word has a higher probability and the least probable word has a lower probability or vice versa. This is done so that when you're selecting words based on their probability you either bias the selection towards higher probability words or you go the other way and make the probabilities more similar in order to generate more random looking text.

To address all these concerns we can use a softmax function to convert the quantities into probabilities which uses a Boltzmann distribution. Given a list of "n" quantities called "q", the probability of the kth quantity is:
e^(q_k/T) / sum(e^(q_i/T) for i in 1..n)
where "T" is called the temperature and is used to control how similar the probabilities should be (by default it is 1).

For example if your quantities are [0, 1, -1], then for temperature 1 the probabilities are:

0:  e^(0/1) / (e^(0/1) + e^(1/1) + e^(-1/1)) = 0.245
1:  e^(1/1) / (e^(0/1) + e^(1/1) + e^(-1/1)) = 0.665
-1: e^(-1/1) / (e^(0/1) + e^(1/1) + e^(-1/1)) = 0.09

Notice how the probabilities and all valid (non-negative and sum to 1), how by raising "e" to the power of the quantity makes it always greater than zero and see what happens when the temperature is set to a larger number:

0:  e^(0/2) / (e^(0/2) + e^(1/2) + e^(-1/2)) = 0.307
1:  e^(1/2) / (e^(0/2) + e^(1/2) + e^(-1/2)) = 0.506
-1: e^(-1/2) / (e^(0/2) + e^(1/2) + e^(-1/2)) = 0.186

See how more similar the probabilities are now with a higher temperature?

The nice thing about the temperature is that even if you have already calculated the probabilities using temperature 1 and have lost the original quantities, you can still change the temperature of the probabilities as follows:

Given probabilities "p", new probability "p'_k" based on temperature "T" are
p'_k = p_k^(1/T) / sum(p_i^(1/T) for i in 1..n)

From the above examples,
0:  0.245^(1/2) / (0.245^(1/2) + 0.665^(1/2) + 0.09^(1/2)) = 0.307
1:  0.665^(1/2) / (0.245^(1/2) + 0.665^(1/2) + 0.09^(1/2)) = 0.506
-1: 0.09^(1/2)  / (0.245^(1/2) + 0.665^(1/2) + 0.09^(1/2)) = 0.186

The roulette wheel selection
Now how do we select a word using it's probability? There is an algorithm called Roulette wheel selection which does this. The idea is to put all the probabilities next to each other on a number line and then select a point on the number line at random. The word whose probability contains the point is chosen. Here's a diagram showing the probabilities 0.6, 0.3 and 0.1 next to each other on a number line and a random red point is placed on the number line in the "word 1" region meaning that word 1 was selected: To do this, we generate a random number between 0 and 1 (the random point), and then start adding probabilities (in any order) until the total becomes larger than the random number. The word belonging to the last probability added is the one that is chosen. Here's a Python 3 implementation:

def select(probs):
r = random.random() #random number between 0 and 1
total = 0.0
for i in range(len(probs)):
total += probs[i]
if total > r: #important to use > and not >= in order to avoid selecting words with a probability of zero
return i
#a words must have been selected at this point or else the probabilities are invalid (either negative or do not sum to 1
raise ValueError('Invalid probabilities')