Geeky is Awesome: March 2018

Thursday, March 29, 2018

Word Mover's Distance (a text semantic similarity measure)

Word Mover's Distance is a way to measure the similarity between the meaning of two sentences. It's basically a bag of words method (ignores the order of the words in sentences) but it's been found to work really well, including for evaluating automatic caption generation. You can find code to perform k-nearest neighbours classification of sentiment written by the creator of WMD here, but reading the code is a challenge so here's how WMD works.

This measure makes use of an existing distance measure called Earth Mover's Distance. Given a set of heaps of dirt and a set of holes, and both the heaps and the holes have a point as a location (a vector) and a certain mass or capacity (a weight), what is the least amount of work needed to move all the dirt into the holes or to fill all the holes completely (depending on whether there is more total mass or total capacity)? By work we mean the total distance travelled multiplied by the mass carried. This is what EMD calculates. It measures the distance between a set of vectors paired with weights and another set of vectors paired with weights, without requiring that both sets have the same number of vectors. It is a linear programming problem and there is a Python library for it called "pyemd" that can be installed via pip.

Now back to Word Mover's Distance. WMD uses EMD between two sentences. It assumes some kind of pre-computed word embedding vectors such as word2vec, which are vectors that encode the meaning of words. The words in the sentences are the heaps and holes and these word vectors are their locations. For weights, we use the normalized frequency of the words in the sentence, that is, the number of times the word was found in the sentence divided by the total number of words in the sentence. We also ignore any words that are stop words (non-informative common words like "the" and "of") and any words that are not found in the list of word vectors. Once we have collected the set of vectors and weights of the two sentences we then just put them in the EMD solver to get the distance.

The Python library "gensim" implements WMD given a word2vec dataset and two lists of words (the sentences) without stop words. You can find a tutorial on how to use it here.