Wednesday, October 22, 2014

Thoughts on lexical substitution: a contextual thesaurus

Lexical substitution is the task of selecting a word that can replace a target word in a context (a sentence, paragraph, document, etc.) without changing the meaning of the context. If you think about it, this is the main application of a thesaurus: using synonyms to replace words.

The problem is, however, that a thesaurus is not necessary for substituting words. Consider the words "questions" and "answers". Would you find them as synonyms in a thesaurus? In the sentence "I got 10 answers right", the word "answers" can be substituted with "questions". Thesauri are not sufficient either because two words which are commonly considered synonyms might not be substitutable in certain contexts even if they are of the same sense, such as "error" and "mistake" in the sentence "An error message popped up."

Many systems that either automatically perform lexical substitution or automatically find synonyms, see SemEval 2007 English lexical substitution task for example, do so by doing the following a simple recipe. To substitute the target word "sat" in the sentence "The cat sat on the mat.":

  • Extract the words "the", "cat", "on", and "mat" from the sentence, that is, the words which are near the target word in the sentence. These are called feature words.
  • Look for another word which occurs in sentences containing the same feature words.
  • That word can replace "sat" in "The cat sat on the mat".

There might be restrictions on which feature words are used, such as not using common words like "the" and "on"; there might be word order restrictions such as the feature words having to be in order in the sentence; there might even be some higher order relationship such as using feature words of feature words; however, essentially all of these methods assume that if another sentence is found which is "The cat ____ on the mat.", then whatever word fills that blank will be a substitutable word for "sat".

Even if thesauri are used to filter words which have completely different meanings, as is usually done, it is not enough for two sentences "a b c X p q r." and "a b c Y p q r." to have identical meaning if X and Y are found in the same thesaurus entry. The problem is that lexical substitution is much more complex than that. In reality, proper lexical substitution requires sentential semantics rather than lexical semantics. When you substitute a word in a sentence, you read out the whole sentence to see if it still sounds like it means the same thing. This is not simply checking if the sentence makes sense, which is what the previous method does, but is a higher cognitive task which involves understanding the meaning of the sentence.

Consider the sentences "I have a big sister." and "I have a large sister.", or "I saw Mary." and "I watched Mary.". In these sentences, the words "big" and "large" and the words "saw" and "watch" have different meanings; however in some other sentences such as "I have a big house." and "I have a large house.", or "I saw a movie." and "I watched a movie.", then the words are synonymous. No thesaurus tells you in which contexts these words mean the same thing, and both sentences make sense and are existent.

Consider also that thesauri are not perfect for this task. For example, WordNet groups together "mother" and "father", whilst Roget's Thesaurus groups "man" with "brother".

It's also important to recognize whether or not the target word is part of an expression or idiom. For example, is the sentence "John kicked the bucket." referring to John actually kicking a bucket or to John dying? If it's an expression then none of the words can be changed.

It seems like understanding the meaning of the sentence is necessary in order to perform lexical substitution. After all, the best performing systems in the previously mentioned SemEval 2007 did no better than 20% correct substitutions. Sentential semantics is a very complex task to automate, since even the exact same sentence can have different correct meanings, such as the previous "John kicked the bucket." However, if idioms, collocations, and other pragmatic constructs (such as saying "expired" instead of "died" due to taboo) could be precisely detected then the rest of the sentences could be solved by using a fine grained lexical ontology.

An ontology maps the relationship between words, such as appropriate verbs and adjectives to nouns. This means hard coding all possible word relations in such detail that the system would know that watching a person is different from watching a movie and that a large sister is different from a big sister. This would also require that words be categorized into semantic categories such as "person". In this way, together with a precise thesaurus, it may be possible to use dependency parsing in order to find which words are connected to the target word in the sentence and check if a synonym found in the thesaurus can be semantically used in the same way as the target word in the sentence. If it changes the meaning of the sentence, then the relations between the words would also change.

If you think about it, this system would be a contextual thesaurus, which is a thesaurus on steroids. Imagine a thesaurus, book form or online, which does not only list words in alphabetical order, but also includes the different words that can be connected to them (using word categories). So the thesaurus would not just have entries that belong to single words like this:

big: large
see: watch

but would instead state which words can be connected to it like this:

big [sibling]: older
big [object]: large
see [movie]: watch
seeing [person]: dating

This would also simplify word sense disambiguation. For example, the word "suit" can be both an article of clothing and a legal action. But in the sentence "I'm wearing a suit.", only one of those two senses can be combined with the verb "wearing". This would help finding which word category a word belongs to. If a word can be in more than one category, simply check which category can be matched with the word's connected words.

Perhaps this thesaurus can be compiled manually by some lexicographers. The challenge is to automatically compile this thesaurus on steroids from online text in the same way that people learn word relations. Until we have such a resource lexical substitution cannot be moved forward to a usable state.