The problem with this approach is what happens when measuring the probability or perplexity of a sentence based on the probabilities of individual words. If you're comparing language models to see which ones make the best predictions, you usually use them all on a corpus to see how well they predict the words in the corpus. The higher the probabilities assigned to those sentences, the better the language model. This is usually measured using language model perplexity. But see what happens when you vary the vocabulary size. You will find that smaller vocabulary sizes lead to better language models, even though this makes no sense.
It turns out that if you just multiply all the probabilities of individual words as-is, including that of the unknown token, then your probability will be sensitive to the vocabulary size. Let's say that you have a vocabulary size of 0, that is, you're considering all the words to be out-of-vocabulary and hence all of them will be replaced by the unknown token. This means that during training, the language model will learn that after every unknown token there is (probably) another unknown token, unless its the end of the sentence. This will make the language model give very high probabilities for the unknown token. High word probabilities mean high sentence probabilities which mean good perplexities.
Now If we add another word to the vocabulary then we'll be introducing some uncertainty into the language model as now it has to decide between using the unknown token or the known word. Even in a perfect language model, the same prefix of words can be followed by either of the two words so there is no way to correctly assign 100% of the probability to one or the other. This means that the probabilities will be split between the two words, leading to an overall decrease in probabilities, leading to a worse perplexity. Adding more words to the vocabulary makes this even worse, which means that language models with smaller vocabularies have a huge unfair advantage over language models that actually do their job and correctly predict the right word.
We can't do away with the unknown token but we can strip away the unknown token's power. Assuming that all the language models are being evaluated on the same corpus, then different vocabularies will have different words being turned into the unknown token. Let's say that your language model considers 1000 different words in its vocabulary but the corpus you're evaluating it on has 500 different words that are out-of-vocabulary. So in reality your language model is predicting one of 1500 different words; it's just that 500 of those words are assumed to be a single word with a single probability. But really there should be 500 separate probabilities for those out-of-vocabulary words and not just one. If we avoid merging all those probabilities into one, then all the language models will have a fair comparison all they will all have the same vocabulary and they will all have the same amount of uncertainty about which word should come next. The question is how to distribute that single unknown token probability between the 500 out-of-vocabulary words. The simplest solution is to assume a uniform distribution and just give each word the same slice of probability from the whole. So if the unknown token has a probability of $p$, then each out-of-vocabulary word gets a probability of $\frac{p}{500}$.
Now every time you encounter the unknown token in the evaluation corpus you know that the token is being used in place of one of those 500 words. But you don't know which one it is. Not a problem, just divide the probability by 500 and it's as if all words in the corpus are in the vocabulary. Do this to every unknown token probability and now you have a fair measure of perplexity. Let's see an example.
Let's say that we want to find the probability of the following sentence:
the loud dog barked at the loud man
and let's say that the language model we're using to do that has the following vocabulary:
the at dog man
this means that the sentence is now changed to:
the UNK dog UNK at the UNK man
Now the naive way to get the probability of the sentence is as follows:
$$
P(\text{the UNK dog UNK at the UNK man}) = \\
p(\text{the}|\text{}) \times p(\text{UNK}|\text{the}) \times p(\text{dog}|\text{the UNK}) \times p(\text{UNK}|\text{the UNK dog}) \\
\times p(\text{at}|\text{the UNK dog UNK}) \times p(\text{the}|\text{the UNK dog UNK at}) \times p(\text{UNK}|\text{the UNK dog UNK at the}) \\
\times p(\text{man}|\text{the UNK dog UNK at the UNK}) \times p(\text{}|\text{the UNK dog UNK at the UNK man})
$$
But now with the new way we'll divide the unknown token's probabilities by 2, the number of different out of vocabulary words ('loud' and 'barked'):
$$
P(\text{the UNK dog UNK at the UNK man}) = \\
p(\text{the}|\text{}) \times p(\text{UNK}|\text{the})/2 \times p(\text{dog}|\text{the UNK}) \times p(\text{UNK}|\text{the UNK dog})/2 \\
\times p(\text{at}|\text{the UNK dog UNK}) \times p(\text{the}|\text{the UNK dog UNK at}) \times p(\text{UNK}|\text{the UNK dog UNK at the})/2 \\
\times p(\text{man}|\text{the UNK dog UNK at the UNK}) \times p(\text{}|\text{the UNK dog UNK at the UNK man})
$$
Of course we can leave the re-weighting till the end of the equation by dividing the first equation by the number of different out-of-vocabulary words for as many times as there are unknown tokens, like this:
$$
P(\text{the UNK dog UNK at the UNK man}) = \\
p(\text{the}|\text{}) \times p(\text{UNK}|\text{the}) \times p(\text{dog}|\text{the UNK}) \times p(\text{UNK}|\text{the UNK dog}) \\
\times p(\text{at}|\text{the UNK dog UNK}) \times p(\text{the}|\text{the UNK dog UNK at}) \times p(\text{UNK}|\text{the UNK dog UNK at the}) \\
\times p(\text{man}|\text{the UNK dog UNK at the UNK}) \times p(\text{}|\text{the UNK dog UNK at the UNK man})/ 2^3
$$
Now the sentence probability goes up as the vocabulary size increases!
No comments:
Post a Comment