Monday, March 27, 2017

Word embeddings: How word2vec and GloVe work

Word embeddings are vectors that represent words. For example the word "dog" might be represented as [0.1, -2.1, 1.2] whilst "cat" might be represented as [0.2, 2.4, 1.1]. These vectors are important in neural networks because neural networks can only work with continuous numbers whereas words are discrete symbols.

One way to represent words as vectors is by taking a finite and fixed vocabulary of words that you want to work with, put the words in the vocabulary in a certain order such as alphabetical, and then use one hot vectors to represent the words. A one hot vector is a vector consisting of zeros everywhere and single one somewhere, such as [0, 0, 1, 0]. The position of the one indicates the position of the word in the vocabulary. The one hot vector [1, 0, 0] represents the first word in a three word vocabulary. You can then feed this vector as an input to a neural network and continue as usual.

One hot vectors are a very wasteful representation in both space and time. They are wasteful in space because you need a vector as large as your vocabulary to represent each word. They are wasteful in time because when computing activation values of the first layer in the neural network, the weight matrix multiplication is going to involve multiplying by a lot of zeros. This is what the matrix multiplication looks like when then first layer outputs two activation values:

(1 0 0) (1 2) = (1*1 + 0*3 + 0*5    1*2 + 0*4 + 0*6) = (1 2)
(3 4)
(5 6)


In fact, if you think about it, all you need is to pick the row in the matrix corresponding to where the one is in the one hot vector. What we usually do in fact is to have an "embedding layer" which is just a matrix with as many rows as the vocabulary size and as many columns as you want your embedded vectors to be big. We then simply pick the row according to the word we want to pass to the neural network. This embedding matrix is then optimized together with the rest of the neural network and the selected vectors passed as input to the next layer as usual.

This has an interesting result. The vectors tend to become similar for similar words, that is, the more similar two words are, the larger the cosine similarity of their corresponding vectors. There are also some other interesting properties such as analogies (asking "man is to king as woman is to...?") and odd word out exercises (which word from these five does not fit with the others?). It seems that the vectors are not just made different for different words but are made different to different degrees according to the differences between the words. If two words are used similarly in certain contexts but not others (such as being similar in gender but not in other cases), then that similarity will be encoded in the vectors. You can see more about these properties in https://colah.github.io/posts/2014-07-NLP-RNNs-Representations/.

Of course there are systems for creating generic word embeddings that are useful for many natural language processing applications. The two most popular generic embeddings are word2vec and GloVe.

word2vec is based on one of two flavours: The continuous bag of words model (CBOW) and the skip-gram model. CBOW is a neural network that is trained to predict which word fits in a gap in a sentence. For example, given the partial sentence "the cat ___ on the", the neural network predicts that "sat" has a high probability of filling the gap. The order of the words is not important: given four words, the neural network predicts the word in the middle. The embedding layer of the neural network is used to represent words in general. On the other hand, the skip-gram model works in the other way round. Given a middle word it predicts which words surround it. Of course by "predicts" we mean that it outputs a probability for each word in the vocabulary. These tasks are meant to force the neural network to create embeddings that reflect the relationship between words, hence bestowing them with meaning.

GloVe takes a different approach. Instead of extracting the embeddings from a neural network that is designed to perform a surrogate task (predicting neighbouring words), the embeddings are optimized directly so that the dot product of two word vectors equals the log of the number of times the two words will occur near each other (within 5 words for example). For example if "dog" and "cat" occur near each other 10 times in a corpus, then vec(dog) dot vec(cat) = log(10). This forces the vectors to somehow encode the frequency distribution of which words occur near them.

Which is better? It probably depends on your data. The nice thing about both of these methods is that you can train your own word vectors based on your own corpus so that the meaning that gets encoded into the vectors becomes specific to your own domain.