Overview
The system described in this paper was used for the EVALITA 2009 where a manually constructed evaluation set in Italian was provided.In order to determine how well a word can replace another in a context, two types of vectors are used: a contextualized vector and an out-of-context vector. The contextualized vector is a composition of out-of-context vectors.
First, out-of-context vectors are collected from the corpus for each word using the usual method of counting how often each word co-occurs with the word being represented by the vector. The counts are then weighted to give less importance to words that co-occur with many words.
In order to find a substitutable word for a target word in a sentence, the out-of-context vectors of each content word in the sentence are collected and their weights are summed together. This is the contextualized vector. The word with the out-of-context vector that is most similar to this contextualized vector is chosen as the best word to substitute.
Nitty gritty
A lemmatized and part of speech tagged 2 billion token Italian corpus was constructed from which to extract the out-of-context vectors. The 20,000 most frequent nouns, verbs, adjectives and adverbs were considered as content words. Out-of-context vectors for these content words were extracted by counting how often they appear with other content words in the same sentence. The counts were weighted using log likelihood ratios and the co-occurrence matrix of counts was compressed to half its size using random indexing.To create a contextualized context vector, the context sentence is lemmatized and part of speech tagged. The sentence is then split into a left and right side with respect to the target word and each side is filtered of all non-content words. The words in the sentence sub-parts that are within a set number of content words from the target word are used to construct the contextualized sentence. Their out-of-context vectors are normalized, multiplied by "1/(4*d)" where "d" is the distance of the word in the sentence sub-part from the target word (the closest content word has a distance of 1), and summed together.
Finally, cosine measure is used to find the most substitutable words by measuring the distance of the contextualized context vector to the out-of-context vectors of all the words that have the same part of speech tag as the target word.
Evaluation
The system was evaluated by using an evaluation set which consisted of sentences with a target word and suggested substitutable words by human annotators. Each suggestion included the number of annotators that suggested it. When comparing the best suggested substitutable word by the system with the most popular suggested word by the annotators, the system matched 10.84% of the evaluation set (that included a most popular suggested word).
No comments:
Post a Comment