Friday, September 6, 2013

Summary of research paper "Automatic Word Sense Discrimination" by Hinrich Sch├╝tze

This is a summary of the research paper This paper describes a way to automatically identify different senses (meanings) for words with multiple senses and to disambiguate the sense of a word in context according to these identified senses.

We start by creating word vectors for each word by counting how often certain other words occur close to it. These other words are called the dimensions of the vector. This vector space is called the word space.

For example, if the word "judge" is to be projected into a word space with dimensions being "legal" and "clothes", then we go through a corpus and count the number of times the words "legal" and "clothes" occur close to "judge" and record these frequencies in a vector. This vector would represent the meaning of "judge" and we'd expect the count for "legal" to be higher than the count for "clothes". On the other hand the word vector for "robe" using the same dimensions would have a higher count for "clothes" than for "legal". The problem is that, in a corpus, an ambiguous word like "suit" would have both the legal sense and the clothes sense so its vector would not give a good indication of its meaning.

In order to disambiguate the sense of an ambiguous word like "suit", the sentence in which it occurs is used as a context to see what that particular instance of the word means. Words are chosen from the sentence, called features, in order to be used as cues to identify in which sense the ambiguous word is being used. For example, if in the sentence with the word "suit" the words "law", "judge" and "statute" are found, then we can be confident that the sense of "suit" in that context is the legal sense, since all of these words have a high count for the "legal" dimension in their word vector. In practice, centroid of the word vectors of these features is calculated and this centroid is then used to see in which dimension the word vectors of the features are generally pointing. This centroid is called the context vector of the ambiguous word.

Of course in practice there will be a lot of dimensions in the word space so senses will not be limited to a single dimension but to combinations of dimensions. This means that in order to establish what the senses of an ambiguous word are, the context vectors of each instance of the word must be clustered such that similar context vectors are clustered as sharing the same sense and each cluster will be a sense of the ambiguous word. The centroid of each cluster is called a sense vector and when an ambiguous word is to be disambiguated into a sense, the sense vector which is most similar to the word's context vector would be considered its sense.

Nitty gritty
Since there must be of a fixed amount of dimensions of the word space and the feature words which can be used for context vectors, a method to select them must be used. Two methods were tested: local and global selection. In local selection the feature and dimension words are exactly the same and different dimensions/features are used for each ambiguous word; so the system will only disambiguate ambiguous words for which dimensions and features were already determined. In global selection the dimension and feature words are different from each other and are selected once such that all ambiguous words are disambiguated using these same dimensions and features.

Local selection works by simply checking all words which are not stop words and which occur within 25 words of the ambiguous word and accepting as dimensions/features only the 1000 most important words. Importance is measured using one of two criteria: the number of times the dimension/feature occurs near the word in question and the chi-squared test of how much the ambiguous word and the dimension/feature word correlate together.

Global selection on the other hand uses the 20,000 most frequent non-stop words in the corpus as features and the 2,000 most frequent words as dimensions.

Once the dimensions and features are collected, they are arranged into a matrix such that the rows represent the dimensions and the columns the features. In this way the columns of the matrix are the word vectors of the different features. In order to disambiguate an ambiguous word, either these word vectors are used as is or they are reduced using Singular Value Decomposition (SVD) in order to reduce the size of the word vectors (number of rows) to 100 (which means that the word vectors will not correspond to frequencies of neighbouring words anymore but to abstract dimensions). Global selection was not tested with SVD.

Context vector elements are normalized and weighted by inverse document frequency.

Similarity between vectors is calculated using cosine similarity and clustering is done using the Buckshot algorithm.

In order to evaluate the system, pseudowords were created by taking two different words and replacing them with the same invented word. If we have two sentences "The boy peeled the banana." and "The boy opened the door.", and the words "banana" and "door" are replaced with the pseudoword "banana/door" such that the sentences are now "The boy peeled the banana/door." and "The boy opened the banana/door.", then the pseudoword is now an ambiguous word with two senses: the banana sense and the door sense. The system is now to attempt to identify these two senses and identify when it is used in the banana sense and when it is used in the door sense. This is easy to evaluated since we already know how each instance should be disambiguated and there is no need for prior manual identification of the senses.

The following is the average percent correct disambiguation for 20 pseudo words for different experimental conditions:
Local-Frequency-no SVD:77.8
Local-Chi squared-no SVD:72.1
Local-Chi squared-SVD:84.1