Geeky is Awesome

Thursday, February 7, 2013

Maximum Likelihood Estimate / Laplace's law

What is the maximum likelihood estimate in statistics? Let's use an example from natural language processing.

Let's say that we want to estimate the probability/likelihood of every word in the English language vocabulary. What this means is that you want to estimate the probability that a randomly chosen word happens to be a particular word. There are many uses for such an estimate, called a unigram, such as attempting to predict what the next word to be written by a user will be (in order to type it for the user). Morse code gives the most frequent letters (letters with a highest probability of being used) the shortest codes in order to make sending messages faster. Something similar can be used for words in order to speed up the searching of words and to compress text files.

How can we estimate such a probability? What is usually done is that a large amount of text is collected, called a corpus, and the words in the text are counted. By dividing the number of times each word occurs by the total number of words in the corpus we will have an estimate of the probability for each word. This assumes there is a large number of words and that the words are used naturally (not a single word repeated over and over for example).

So P(w) = F(w)/F(*),
where P(w) is the probability of using a word called "w", F(w) is the frequency or number of times that the word "w" is found in the corpus and F(*) is the total number of words in the corpus.

What is the problem with this estimation? The problem is that any words which are not found in the corpus are assumed to have a probability of zero, that is, that they don't exist. This is not realistic, as even a very large corpus would not have every possible word ever, and new words are invented all the time. What we need to do is give a little less probability to the words we have in the corpus in order to share it with the words which are not in the corpus. This is called smoothing.

As it is, the way we are estimating the probability gives all the probability to the words in the corpus, which is the maximum probability you can give them. For this reason, this method is called the maximum likelihood estimate.

Now how can we smooth out our probability to give the unknown words a little of the probability too? A simple way we can do this is by assuming that each unknown word has an equal probability which is less than the probabilities of the words in the corpus. We can do this easily by assuming that each unknown word has a frequency of 1 and that each known word has a frequency of one more than it actually is. By increasing the number of times that each word occurs in the corpus by 1, including the words which aren't there, our probabilities will change such that the unknown words will all have an equal probability which is less than any known word's probability. In effect, no word will have a probability of zero, since every word would now look like it occurs at least once.

We might be tempted to say that the new probability is "(F(w) + 1)/F(*)", but this would be wrong because when you add together the probabilities of each word the total should come to 1, which it will not using this equation. As it was before, if you add together all the probabilities in the form of "F(w)/F(*)", the total would come to 1. For example if our corpus consisted of just one sentence "the boy played with the girl", then the probabilities of each word would be 2/6 for "the", 1/6 for "boy", 1/6 for "played", 1/6 for "with" and 1/6 for "girl". 2/6 + 1/6 + 1/6 + 1/6 + 1/6 = 1. If we use the wrong formula on the other hand, the probabilities would be 3/6 for "the", 2/6 for "boy", 2/6 for "played", 2/6 for "with" and 2/6 for "girl". 3/6 + 2/6 + 2/6 + 2/6 + 2/6 = 11/6 ≈ 1.8.

What we need to do is to add 1 to the denominator of the probability fraction for every different word we can find the probability of. What this means is that we need to know how many unknown words together with the known words there can be. This is called the vocabulary size and assuming that you know it you just need to add it to the denominator of the probability fraction like this:

P(w) = (F(w) + 1)/(F(*) + V),
where V is the vocabulary size or the total number of words including unknown ones.

This method is called Laplace's law, or rule of succession or add-one smoothing. It's not a very realistic estimation either however, since unknown words will not have equal probabilities and will not necessarily have a probability less than the rarest words in the corpus. Also, giving the unknown words a frequency of 1 might look like it's small, but when you consider that it is just one less than the smallest frequency in the corpus then it might be too large an estimate, especially considering that we might be including words in our vocabulary size which just don't exist (maybe no one uses them anymore).

There are other smoothing methods which can be used, such as using synonyms of unknown words. By using the probability of synonyms which are found in the corpus you can find an estimation of the unknown word's probability.

Let's see another example of how maximum likelihood estimate and Laplace's law can be used. Let's say that we want to do something more interesting than finding the probabilities of unigrams (single words) and instead want to find the probabilities of bigrams, that is, pairs of words. The nice thing about bigrams is that you can use a word in order to predict what the next word might be in a sentence. This would be useful for making writing text messages faster by having your cell phone predict and write for you the next word you want to write. Using more words than just the previous word in order to predict the next word would make for more accurate guesses, but let's just consider bigrams instead of the full "ngrams".

The probability we want to find is that of the user writing a given word provided that the previous word he/she wrote is known. So for example, we want to know what the probability of the user writing "cream" is after he/she has written the word "ice". We can then do this for every word and sort them by their probability in order to present the user with a list of suggested next words. Using maximum likelihood estimate, the probability is found by taking a corpus and counting the number of times the word "ice cream" is found and dividing that with the number of times the word "ice" is found. That way we'll estimate the probability of the word "ice" being followed by the word "cream".

P(w1 w2) = F(w1 w2)/F(w1),
where w1 is the first word and w2 is the next word.

Using Laplace's law we need to add 1 to the numerator and add the total number of possible word pairs you can have (known and unknown) to the denominator. The total number of possible word pairs might be calculated simply as the vocabulary size squared, that is, the vocabulary size multiplied by itself. This assumes that every possible word can follow every possible other word, include the same word. This isn't realistic at all but it might do.

P(w1 w2) = (F(w1 w2) + 1)/(F(w1) + V^2)

Saturday, January 12, 2013

Summary of research paper "Discovering Word Senses from Text" by Patrick Pantel and Dekang Lin

This is a summary of the research paper http://webdocs.cs.ualberta.ca/~lindek/papers/kdd02.pdf. This paper describes a way to group together similar words according to the different meanings of each word and all by just using the contexts of the words. You should also check the PhD thesis of the algorithm used in this paper by Panten at http://www.patrickpantel.com/download/papers/2003/cbc.pdf. You might also be interested in seeing this Java implementation of the algorithm at https://github.com/fozziethebeat/S-Space/blob/master/src/main/java/edu/ucla/sspace/clustering/ClusteringByCommittee.java.

In order to judge the similarity between two words, the grammatical relations between those words and other words as used in a corpus were used. For example, US state names can be recognized as similar because they all share the following contexts:
___ appellate court
campaign in ___
___ capital
governor of ___
___ driver's license
illegal in ___
primary in ___
___'s sales tax
senator for ___

So, for example, by counting the number of times the verb "campaign" is grammatically related to each word in a corpus, that count can be used to find other words which are similar because they have similar counts. The grammatically related word is called a feature of the word. By finding all the different features of each word in a corpus, we can give each word a list of its features and the number of times that feature was encountered as a context of the word. For example, we can find the number of times the word "campaign" appears as a feature of "California", as well as the word "capital", "governor", etc. All these features when grouped together are called a feature vector of the word.

However, in this paper, it is not the counts, or frequencies, that are associated with the features. Using raw counts to judge the importance of a feature is misleading as the feature might just be a common feature in general and not useful to find similar words by checking if they have features in common. So instead the pointwise mutual information is used. This compares the probability of the two word co-occurring together with the probability that they do so by coincidence (independently). In this paper it is calculated as:

mi(w,c) = p(w,c) / (p(w) × p(c))

where "w" is the word, "c" is a context of "w", "p(w,c)" is the probability of finding "c" as a context of "w", "p(w)" is the probability of finding the word "w" in some context and "p(c)" is the probability of finding "c" as a context of some word. These probabilities are calculated as follows:

p(w,c) = f(w,c) / N
p(w) = sum(f(w,c) for c in contexts) / N
p(c) = sum(f(w,c) for w in words) / N
N = sum(sum(f(w,c) for w in words) for c in contexts)

where "f(w,c)" is the number of times "c" is a context of "w" and "N" is the total number of word-context combinations.

However the pointwise mutual information gives bigger values to rare words and contexts so it is made fair by being multiplied by a discounting factor:

df(w,c) = f(w,c) / (f(w,c)+1) × min(f(w), f(c)) / min(f(w), f(c))

where

f(w) = sum(f(w,c) for c in contexts)
f(c) = sum(f(w,c) for w in words)

In order to measure the similarity between two words, the cosine coefficient is used:

sim(a,b) = sum(mi(a,c) × mi(b,c) for c in contexts) / sqrt(sum(mi(a,c)^2 for c in contexts) × sum(mi(b,c)^2 for c in contexts))

Words with different meanings, or senses, are called polysemous words. In order to cluster the words together according to their senses, using a different cluster for each sense, three phases are used.

In the first phase, the top 10 similar words for each word were found. In order to speed up the search, the feature vectors were indexed by their features, such that you can find all the feature vectors that have a given feature. The index is then used to find feature vectors that share features in common with the word we are finding similar words to. In this way you skip all the words which share nothing in common with the word of interest. In order to further speed things up, only some of the word's features are considered. The features which have a small value (pointwise mutual information) are ignored when looking for similar words. So only words which share these high valued features are retrieved from the index. So if we are looking for similar words to a word with the following feature vector:

{
  ("object of", "carry") : 5.118,
  ("object of", "bring") : 3.364,
  ("object of", "look at") : 2.753,
  ("object of", "have") : 1.353
}

and we only consider features with a value higher than 3.0 then we will use the index to find all the words which have a feature vector with the features

{
  ("object of", "carry") : 5.118,
  ("object of", "bring") : 3.364
}

This narrows down the number of words to check the similarity of (using the sim function) and so speeds up the search greatly. The top 10 most similar words are then retrieved. The fact that words which don't share high valued features are ignored did not seem to affect the result.

In the second phase, a clustering algorithm called "clustering by committee" is applied to the words using the similarity lists of the first phase. The clustering algorithm tries to assign the words into small clusters called "committees". Words which don't fit into any committee are called "residue". The point of these small clusters is to obtain centroids from which to cluster all the words, just like the randomly chosen centroids for k-means clustering, but instead the centroids are chosen intelligently. These centroids would originate from committees of words all of which mean a particular word sense. In this way the committee would decide whether to allow a word in the cluster or not. The way each committee is chosen is such that it is composed of words which do not have multiple meanings as used in the corpus. This makes the centroid of its feature vectors tend to only have features which belong to a particular word sense rather than to multiple senses, which leads to purer clusters of word senses. For example, when clustering USA state names, state names which are also names of cities (such as New York) are not included in the committee and only state names which mean nothing else are included which would avoid clustering city names in with the state name cluster when full clustering is performed. This is the clustering algorithm:

def clustering_by_committee(words, simlists, committeeSimilarityThreshold, residueSimilarityThreshold):
    highest_scoring_clusters = []
    #STEP 1.1
    for word in words:
        #STEP 1.2
        clusters = averagelink_cluster(simlists[word]) #use average link hierarchical agglomerative clustering to cluster the words into similarity clusters such that clusters can only be merged if their similarity is below a minimum similarity threshold
        #STEP 1.3
        for cluster in clusters:
            scores[cluster] = size(cluster) * average_pairwise_similarity(cluster)
        #STEP 1.4
        highest_scoring_clusters.append(arg_max(scores)) #append the highest scoring cluster

    #STEP 2.1
    sort_descending(highest_scoring_clusters) #sort highest_scoring_clusters in descending order of score

    #STEP 3.1
    committees = []
    #STEP 3.2
    for cluster in highest_scoring_clusters:
        #STEP 3.3
        centroid = compute_centroid(cluster) #the centroid is computed by averaging the frequencies (not the pointwise mutual information) of the features of each word in the cluster. these average frequencies are used to compute the pointwise mutual information which will be used to create the centroid vector (just as the feature vectors were constructed).
        #STEP 3.4
        if similarity(centroid, any committee in committees) < committeeSimilarityThreshold: #a committee cannot be very similar to other committees
            committees.append(cluster)

    #STEP 4.1
    if committees == []:
        return committees

    residues = []
    #STEP 5.1
    for word in words:
        #STEP 5.2
        if similarity(word, any committee in committees) < residueSimilarityThreshold: #if a word is not sufficiently close to any committee then it is a residue and more committees are needed for this word to belong to. otherwise the word is "covered" by a committee.
            residues.append(word)

    #STEP 6.1
    if residues == []:
        return committees
    #STEP 6.2
    else:
        return committes + clustering_by_committee(residues, simlists, threshold1, threshold2)

In the third phase, the committees discovered in the second phase are used to actually cluster words by their different senses. The words are not added to committees themselves, rather to clusters each of which is represented by its own committee. The similarity of a word to a cluster is the similarity between the word's feature vector and the centroid of the cluster's committee (which remains the same regardless of which words are added to the cluster). If hard clustering is used then words are only added to the most similar cluster, but if soft clustering is used, as is the case here, then the same word can be added to more than one cluster, one for each sense of the word. Here is the algorithm which assigns words to clusters:

def assign_word_to_clusters(word, committeeClusters, minWordSimilarity, maxClusterSimilarity):
    clusters = []
    similar_clusters = top_200_most_similar(committeeClusters, word) #find the top 200 clusters whose committee's centroid is most similar to the word
    while similar_clusters != []:
        best_cluster = get_most_similar_cluster(similar_clusters, word)
        if similarity(word, best_cluster) < minWordSimilarity:
            break #there is no cluster left that is similar enough to assign the word to
        else if similarity(best_cluster, any cluster in similar_clusters) < maxClusterSimilarity: #a word cannot be assigned to a cluster whose committee's centroid is similar to another cluster's committee's centroid
            best_cluster.add(word)
            remove_common_features(word, best_cluster) #remove the features from the word's feature vector which are also found in best_cluster's committee's centroid
        similar_clusters.remove(best_cluster)

The fact that the features in the word which are also found in the cluster's committee's centroid are removed before it is assigned to another cluster is important to allow it to be assigned to other less common sense clusters. For example, the word "plant" would first be assigned to the "factory" sense cluster in a newspaper corpus because that is the meaning of the word most of the time in such a corpus. After it is assigned to the factory sense cluster, the features which have to do with this sense are removed (or their values are set to 0) from the feature vector of "plant". Without these features, the remaining features would be of a different sense such as of living plant and so now it can be easily assigned to the living plant sense as these features are not overshadowed anymore. So this system will output a list of clusters to which each sense of every word will belong. In order to evaluate it, the sense clusters are compared to WordNet synonym groups called synsets. This is how the comparison is made: First, the similarity measure proposed by Lin which measures the similarity between two WordNet synsets is used. This is described in another post here.

sim_synset(s1, s2) = 2 log P(most_specific_common_synset(s1, s2)) / (log P(s1) + log P(s2))

The similarity between a word and a synset is the maximum similarity between the synset and one of the synsets in which the word belongs:

sim_synset_word(s, w) = max(sim_synset(s, t) for t in synsets_of(w))

Let "top_words(c, k)" be the top "k" most similar words in a sense cluster "c" to the committee which the cluster was based on. The similarity between a synset and a sense cluster is the average similarity between the synset and "top_words(c, k)".

sim_synset_cluster(s, c, k) = sum(sim_synset_word(s, w) for w in top_words(c, k)) / k

We can check if a word belongs to a correct sense cluster by checking the similarity of the sense cluster to all the synsets of the word. If the maximum similarity found is greater than a threshold, then it is considered to be a correct sense cluster.

is_correct_sense_cluster(w, c, k, threshold) = max(sim_synset_cluster(s, c, k) for s in synsets_of(w)) >= threshold

"k" was set to 4 and the threshold was varied in the experiments. By checking for which synset is most similar to the cluster, we can find the WordNet sense that corresponds to the cluster. If more than one cluster corresponds to the same WordNet sense, then only one of them is considered correct. The precision of the system's assignment of a particular word is the percentage of clusters the word is assigned to that are a correct sense. The precision of the system in general is the average precision of all words. The recall of the system's assignment of a particular word is the percentage of correct clusters the word is assigned to out of all the actual correct senses of the word which are found in the corpus. Since the corpus was not sense tagged this is a tricky situation. So the next best thing was to use several clustering methods to extract as many correct sense clusters (as defined above) for the words as possible. The union of all these correct senses are used as the number of different senses each word has in the corpus. The overall recall of the system is the average recall of every word. The precision and recall scores are combined into an F measure:

F = 2 × precision × recall / (precision + recall)

In order to do the evaluation, 144 million words were parsed using Minipar in order to extract the grammatical relations between words. The intersection of words which are found in WordNet with those which in the corpus have a pointwise mutual information with all their features greater than 250 was used for evaluation. 13403 words met the criteria and the average number of features per word was 740.8. As was already mentioned, the threshold described in "is_correct_sense_cluster" was varied during the experiment. Here are the results of the F measure using different thresholds with different clustering algorithms. The clustering algorithms were modified to just give the centroids of the clusters and then associated all words which have a similarity greater than 0.18. This is in order to associate the same word with different clusters for each sense. The system described is "CBC" which stands for "Clustering By Committee" which outperformed all the other clustering methods for all thresholds.

Wednesday, December 26, 2012

Summary of research paper "Automatic Retrieval and Clustering of Similar Words" by Dekang Lin

This is a summary of the research paper http://webdocs.cs.ualberta.ca/~lindek/papers/acl98.pdf. This paper describes a way to measure the similarity between words using their contexts. It also describes a way to evaluate the similarities by comparing with existing thesauruses.

The way similarity between two words is calculated is by checking how the words are used in a corpus. To check how a word is used, the grammatical relations between the word and other words in the same sentences it is found are used. For example, these are the grammatical relations for the word "dog" in the sentence "I have a brown dog":

(dog obj-of have): it is an object of the verb "have"
(dog adj-mod brown): it is modified by the adjective "brown"
(dog det a): it takes the determiner "a"

The way the relations are represented is called "dependency triples". By counting these dependency triples for every word in a corpus, the frequencies of each relation can be used to determine if two words are used in the same way and by extension if they mean the same thing. This is called the Distributional Hypothesis which says that words that words that appear in similar contexts are semantically similar. These dependency triples are collected using a broad coverage dependency parser.

In order to describe how these frequencies are used in calculations, the following notation is used: ||w,r,w'|| which is the frequency of the dependency triple (w r w'), that is, that word "w" is related to word "w'" with the relation "r". For example, ||cell, subj-of, absorb|| is the number of times the word "cell" is used as a subject of the word "absorb" in the corpus. Asterisks are used a wild cards, that is, to say that that it doesn't matter what is instead of the asterisks. For example, ||cell, subj-of, *|| is the number of times the word "cell" is used as a subject of some other word, ||cell, *, absorb|| is the number of times the word "cell" is somehow related to the word "absorb", ||*, subj-of, absorb|| is the number of times a word is used as a subject of the word "absorb" and ||*, *, *|| is total number of dependency triples of any kind.

In order to determine how similar two words are based on the frequency of their relations, a formula from another paper by the same author called Using Syntactic Dependency as Local Context to Resolve Word Sense Ambiguity (http://aclweb.org/anthology-new/P/P97/P97-1009.pdf) is used which says that

similarity(A, B) = log P(common(A, B)) / log P(describe(A, B))

This means that the similarity between two words A and B is the amount of information needed to describe what is in common between both words divided by the amount of information needed to describe both words. The less the words have information in common, the less similar they are.

The amount of information needed to describe a word is the sum of the information in each of its dependency triples. The amount of information needed to describe a dependency triple (w, r, w') is -log(actual probability that a randomly selected dependency triple is (w, r, w')) - -log(expected probability that a randomly selected dependency triple is (w, r, w')).

The actual probability of (w, r, w') is ||w, r, w'|| / ||*, *, *||.

The expected probability of (w, r, w') can be estimated using the probability that a randomly selected relation is "r", that a randomly selected word is "w" given that the relation is "r" and that another randomly selected word is "w'" given that the relation is "r". These probabilities are assumed to be independent and are simply multiplied together to find the probability of the whole dependency triple. This probability is P(r)P(w|r)P(w'|r) which equals (||*, r, *|| / ||*, *, *||) × (||w, r, *|| / ||*, r, *||) × (||*, r, w'|| / ||*, r, *||).

So the amount of information needed to describe a dependency triple is I(w,r,w') = log((||w,r,w'|| × ||*,r,*||) / (||w,r,*|| × ||*,r,w'||))

We can now finally define the function which gives the similarity between two words.

Let "T(w)" be a function which when given a word "w" will return the set of pairs "(r, w')" which make I(w,r,w') positive.

The similarity between 2 words, "w1" and "w2" is given by the function sim(w1, w2) = sum(I(w1, r, w) + I(w2, r, w) for (r,w) in (T(w1) ∩ T(w2))) / (sum(I(w1, r, w) for (r,w) in T(w1)) + sum(I(w2, r, w) for (r,w) in T(w2)))

With our function sim(w1, w2) we can now start measuring how similar two words are based on their relations to other words.

A 64 million word corpus was parsed using a broad coverage dependency parser which extracted 56.5 million dependency triples. For each noun, verb, adjective and adverb, the top 200 most similar words were found. The resulting thesaurus of similar words consists of these 200 most similar words for each word sorted in descending order of their similarity together with their similarity, for example:

brief (noun): affidavit 0.13, petition 0.05, memorandum 0.05, motion 0.05, lawsuit 0.05, deposition 0.05, slight 0.05, prospectus 0.04, document 0.04, paper 0.04 ...

In order to determine the quality of the similarity measure, its values were compared to existing thesauruses Roget and WordNet in order to see if the similar words determined by the measure are the same as the similar words in the thesauruses. Other similarity measures were also used in order to compare them all together.

These other similarity measures were:

Cosine similarity:
sim_cos(w1, w2) = |T(w1) ∩ T(w2)| / sqrt(|T(w1)| × |T(w2)|)

Hindle similarity which is the same as that proposed in a paper called NOUN CLASSIFICATION FROM PREDICATE ARGUMENT STRUCTURES except that it does not use dependency triples which have negative information (the measure only uses dependency triples whose relation is of subject of and object of):
sim_hindle(w1, w2) = sum(min(I(w1, r, w), I(w2, r, w)) for (r,w) in (T(w1) ∩ T(w2)) if r in {subj-of, obj-of})

and Hindle_r which is the same as Hindle except that all dependency relations are used:
sim_hindle_r(w1, w2) = sum(min(I(w1, r, w), I(w2, r, w)) for (r,w) in (T(w1) ∩ T(w2)))

In order to compare the similarities given by these measures with the thesauruses, the thesauruses needed to have a quantified similarity associated between their similar words. For this reason a similarity measure which gives the similarity between two words according to the thesauruses were developed.

For WordNet, the similarity measures use super classes of the different senses of words. Every word in WordNet belongs to some category of words called a class, here is an example:

Next to each class is the probability that a randomly selected word belongs to that class, which is done by counting the number of words in each class. In order to quantify the similarity between words in WordNet, the most specific class which is common between the two words is used, which in the case of "hill" and "coast" would be "geological formation". By finding the probability of this class for every different sense (meaning) of the two words, converting the probabilities of those classes into information and taking the maximum of the informations, the similarity of the two words would be computed. Formally, this is as follows:
sim_WN(w1, w2) = max(2 log(P(most_specific_common_class(s1, s2))) / (log(P(s1)) + log(P(s2))) for s1 in senses(w1) for s2 in senses(w2))

For Roget, the Roget categories of the words are used. The more words are in common of both categories, the more similar the words are. This is measured as such:
sim_Roget(w1, w2) = 2|RogetCategory(w1) ∩ RogetCategory(w2)| / (|RogetCategory(w1)| + |RogetCategory(w2)|)

Once these measures were defined, the thesauruses were transformed into a form similar to the generated ones. Here is an example:

brief (noun) according to "sim" thesaurus: affidavit 0.13, petition 0.05, memorandum 0.05, motion 0.05, lawsuit 0.05, deposition 0.05, slight 0.05, prospectus 0.04, document 0.04, paper 0.04

brief (noun) according to "sim_WN" thesaurus: outline 0.96, instrument 0.84, summary 0.84, affidavit 0.80, deposition 0.80, law 0.77, survey 0.74, sketch 0.74, resume 0.74, argument 0.74

brief (noun) according to "sim_Roget" thesaurus: recital 0.77, saga 0.77, autobiography 0.77, anecdote 0.77, novel 0.77, novelist 0.77, tradition 0.70, historian 0.70, tale 0.64

In order to determine the similarity between each set of similar words, the following formula was used:
entries_similarity(entries1, entries2) = sum(entry1.similarity × entry2.similarity for entry1 in entries1 for entry2 in entries2 if entry1.word == entry2.word) / sqrt(sum(entry1.similarity^2 for entry1 in entries1) × sum(entry2.similarity^2 for entry2 in entries2))

For example, the similarity between the entries for "brief" using "sim" thesaurus and using "sim_WN" thesaurus is:
sum(0.13 × 0.80, 0.05 × 0.80) / sqrt(sum(0.13^2, 0.05^2, 0.05^2, 0.05^2, 0.05^2, 0.05^2, 0.05^2, 0.04^2, 0.04^2, 0.04^2) × sum(0.96^2, 0.84^2, 0.84^2, 0.80^2, 0.80^2, 0.77^2, 0.74^2, 0.74^2, 0.74^2, 0.74^2)) = 0.297215762

The thesauruses were compared and only words which occurred more than 100 times and which were found in all thesauruses were used to evaluate. The average similarity between entries in the thesauruses was calculated.

When compared to WordNet entries:
sim 0.212199
Hindle 0.204179
cosine 0.199402
Roget 0.178397
Hindle_r 0.164716

When compared to Roget entries:
WordNet 0.178397
sim 0.149045
Hindle 0.14663
cosine 0.135697
Hindle_r 0.115489

It can be seen that the "sim" thesuarus is always better than the other generated thesauruses, but WordNet is closer to Roget than the "sim" thesaurus.

Wednesday, December 12, 2012

An equation for enumerating integers

In my last post I described Cantor's way of proving that some infinite lists are bigger than others. I said that his is done by showing that a number of infinite lists, which are called countably infinite, can be paired up with every natural number from 0 onward. I shall now describe an equation that gives the integer that is paired up with a particular natural number.

The integers are (..., -2, -1, 0, 1, 2, ...)
We can pair up integers with natural numbers by pairing all the even natural numbers with positive integers and all the odd natural numbers with negative integers, like this:
0:0, 1:-1, 2:1, 3:-2, 4:2, ...

An equation which maps the natural numbers to the integers in this way is as follows:

y = -(x+1)/2*(x mod 2) + x/2*((x+1) mod 2)

"mod" is an operator called a modulo operator which gives the resulting remainder for a number divided by another. For example 4 mod 2 gives 0 and 5 mod 2 gives 1.

"x mod 2" will give 1 when x is odd and 0 when x is even.
"(x+1) mod 2" will give 1 when x is even and 0 when x is odd.

These are used in order to multiply them with terms that will deal with either the odd natural numbers or the even natural numbers.

"-(x+1)/2" will work when x is odd and will return -1 for 1, -2 for 3, -3 for 4, etc.
"x/2" will work when x is even and will return 1 for 2, 2 for 4, 3 for 6, etc.

These are used to pair up the negative integers with the odd natural numbers and the positive integers with the even natural numbers.

"-(x+1)/2*(x mod 2)" will give the negative integers for odd natural numbers and 0 for even natural numbers, as "-(x+1)/2" will be multiplied by 1 for odd naturals and by 0 for even naturals.
"x/2*((x+1) mod 2)" will give the positive integers for even natural numbers and 0 for odd natural numbers, as "x/2" will be multiplied by 1 for even naturals and by 0 for odd naturals.

By adding them together, since both terms will alternate between 0 and their signed integers, the sum will be an alternating sequence of positive and negative numbers for every natural number x. When x is 0, the result of the sum will also be 0 as both terms will be multiplied by 0.

The equation in Python 3 is as follows:

for x in range(20):
 print(x, ":", -(x+1)//2*(x % 2) + x//2*((x+1) % 2))

And here is the output:

0 : 0
1 : -1
2 : 1
3 : -2
4 : 2
5 : -3
6 : 3
7 : -4
8 : 4
9 : -5
10 : 5
11 : -6
12 : 6
13 : -7
14 : 7
15 : -8
16 : 8
17 : -9
18 : 9
19 : -10

Friday, November 16, 2012

Countable and uncountable infinities

I love this topic of mathematics. Can infinite lists be compared in size? Let's start from finite lists.

How can you tell that the list [a,b,c] is smaller than the list [w,x,y,z]? You can count them both and then compare the sizes, or you can check if every object in the first list can be compared with another different object in the second list. If we try, we'd end up with something like this:
[a:w, b:x, c:y, ?:z]

Since not every object in the first list can be paired up with the objects in the second, we know that the first is smaller. This is related to the "pigeon hole principle", which says that if you try to associate every object in a list with every object in a smaller list, then you'd have to reuse some objects in the smaller list.

So, since we cannot count the number of objects in infinite lists, we can only try to see what happens if we try to pair up the objects in the lists.

Let's start with the simplest case. Is the list of whole numbers starting from 0 [0,1,2,...] larger than the list of whole numbers starting from 1 [1,2,3,...]? Now remember than both lists are infinite in size, that is, there is no last number in either list, so you cannot reach the end in either list. So if we pair up the numbers by first, second, etc, we get this pairing:
[0:1, 1:2, 2:3, 3:4, ...]

So it is clear that every object can be paired up in both lists and so both lists are of equal size. You might not think that this makes sense because clearly the first list contains all of the numbers in the second list. That's true, but it is also true that no number will every be left out if we try to pair them up, so determining which list if bigger by checking which list contains which is not useful in infinite lists.

A more accurate proof for showing that both lists are of equal size is by writing the equation which gives the number in the second list that is paired up with a particular number in the first list. In this case, if we call the number in the first list y and the number in the second list x, such that our pairing list would be [(x:y), (x:y), (x:y), ...], then y = x+1.

What about the list of even numbers [0,2,4,...] and the list of whole numbers starting from 0 [0,1,2,...]? Well we can pair up the even numbers with their halves in the second list, like this:
[0:0, 1:2, 2:4, 3:6, ...]

Will any number be left out? Doesn't look like it will, given that there is no last number in either list.

In this case, the relationship between each pair is y = 2x.

What about the list of positive and negative whole numbers [...,-2,-1,0,1,2,...] and the list of whole numbers starting from 0 [0,1,2,...]? We can pair up the zeros together and then pair up the positive numbers with the even numbers and the negative numbers with the odd numbers, like this:
[0:0, 1:-1, 2:1, 3:-2, 4:2, ...]

Can you think of an equation which gives you which number in the second list is mapped with a particular number in the first? It requires the use of the modulo operator which gives the remainder that would result when two numbers are divided together. For example 3 mod 2 gives 1 and 4 mod 2 gives 0.

What about fractions [1/1, 1/2, 1/3, ..., 2/1, 2/2, 2/3, ..., ...] with whole numbers starting from 0 [0,1,2,...]? Well, the way we ordered the fractions makes it look impossible, but if we look at the fractions as a grid like this:

1/1 1/2 1/3 1/4 ...
2/1 2/2 2/2 2/3 ...
3/1 3/2 3/2 4/3 ...
4/1 4/2 4/2 3/3 ...
...

Then we see that what we were doing was looking at the fractions row by row. But we can instead look at the diagonals. We start from the corner 1/1, then we take 2/1 and 1/2, then 3/1, 2/2 and 1/3, and we keep on going from one diagonal to the next. We end up with this pairing:
[0:1/1, 1:2/1, 2:1/2, 3:3/1, 4:2/2, ...]

Can you think of an equation to determine which fraction is paired up with a particular whole number? It requires the use of "triangular numbers" in order to know which whole number starts a new diagonal in the grid.

What about the list of all triples of numbers [(1,1,1), (1,1,2), ..., (1,2,1), (1,2,2), ..., (1,3,1), (1,3,2), ..., (2,1,1), (2,1,2), ...] and the list of all whole numbers starting from 0 [0,1,2,...]? Again, this is just a matter of changing the order just like the fractions. This time instead of a grid we have a cube of triples, which cannot be shown in text. So instead we need to order them in diagonal squares around the corner of the cube.

Can you think of an equation to determine which triple, quadruple or any size of number group will be paired with a particular whole number?

You may be thinking that any infinite list is of equal size to the whole numbers starting from zero. But what about the list of all numbers, including irrational numbers, that is, numbers which never end after the decimal point such as pi?

Let's assume that there is such a pairing. For example:
[0:0.000000000..., 1:0.100000000..., 2:1.002040000..., ...]
Can you think of a number in the list of all numbers which will not be a part of the pairings? Since the list of pairings is infinite in size and even the number of digits in each number in the list is, then we have an infinite grid, like this:

0:0.000000000...
1:0.100000000...
2:1.002040000...
...

Since we have this infinite grid of digits, we can create the following number: Take the diagonal of the grid going from the top left corner downwards. If the next digit in the diagonal is a 0, the next digit in our number must be a 9, but if it is any other digit then the next digit in our number must be a 0. In this way our new number will be different from every number in the grid. Here's an example:

0:0.000000000...
1:0.100000000...
2:1.002040000...
...

0:0.#########...
1:#.1########...
2:#.#0#######...
...

First digit in the diagonal is a 0, so the first digit in our new number must be a 9. Second digit in the diagonal is a 1, so the second digit in our new number must be a 0, third digit must be a 9, etc. So our new number in this case will look like this:
9.09...
This number will not be in the grid. Even if we included it, we can find another number which isn't in the grid using the same method. No matter how many new numbers we add to the grid, we can always find another number which isn't in the grid.

So there is no way we can pair up all the numbers with all the whole numbers starting from 0, or with all the whole numbers (positive and negative) or all the fractions. You can always find numbers which weren't paired up.

Infinite lists which can be paired up with all the whole numbers are called "countable infinite lists". These lists have a way to be ordered in such a way that no object in the list will be left out if you check every object in the list from the beginning. Infinite lists which cannot be paired up in this way are called "uncountable infinite lists". These lists cannot be ordered in such a way that you can reach any object by moving forward from the start.

This is a very interesting concept which can be extended to think about if every number can be generated using computer programs, like pi can. The number of computer programs is countable but the number of numbers isn't, so no.

What to learn more? Look up Georg Cantor who came up with these concepts.

Monday, October 8, 2012

Summary of research paper "Putting it Simply: a Context-Aware Approach to Lexical Simplification" by Or Biran, Samuel Brody and Nomie Elhadad

This is a summary of the research paper www.aclweb.org/anthology/P11-2087. This paper describes a way to lexically simplify text using unsupervised methods. Lexical simplification is the process of replacing difficult words with simpler words which approximately mean that same thing in context. The method described in this papers does not require a parallel corpus and instead only requires text from complex and simplified documents in order to compare the word counts in each.

The first thing that was done was to download Wikipedia and Simple English Wikipedia articles (3,266,245 articles and 60,100 articles respectively) the following preprocessing tasks were performed on them:

Comments were removed
HTML tags were removed
Links were removed
Tables and figures were removed (with associated text)
Text was tokenized
All word were turned to lower case
Sentences were split

The last 3 steps were carried out using the Stanford NLP Package.

The system consists of first learning simplification rules (finding pairs of difficult/simpler words) and then applying them by finding which is the best word to replace a difficult word. We shall now look at each of these stages.

To learn the simplification rules, a 10-token window (10 tokens on each side) which crosses sentence boundaries (confirmed through private communication with the authors) was used to construct a context vector of each token (excluding stop words, numbers and punctuation) in the English Wikipedia corpus. All numbers were represented with an artificial token such as "<number>" which allows you to count all instances of any number next to a token. Only frequencies greater than 2 were kept in the context vector.

In order to create a list of possible word substitutions, first a Cartesian product of all words in the English Wikipedia corpus was created which was then filtered in the following ways in order to keep only word pairs which are simplification replacements:

All morphological variants where removed by checking their lemma with MorphAdorner.
In order to find more morphological variants which are not handled by MorphAdorner, all words were one is a prefix of the other followed by the suffixes "s", "es", "ed", "ly", "er", "ing" were also removed.
WordNet was used to remove all word pairs where the first word is not a synonym or hypernym of each second.
In order to keep only word pairs where the second word is simpler than the first, lexical complexity is measured as the frequency of the word in the English Wikipedia corpus is divided by the frequency of the word in the Simple English Wikipedia corpus and the ratio is multiplied by the length of the word. The reasoning is that the more often a word appears in the Simple English Wikipedia than it does in English Wikipedia and the shorter the word is, the simpler it is and so the smaller it's complexity measure. All word pairs where the second word is harder than the first are removed.

Now we have the substitution list. In order to make the list more practical, each word pair's similarity in meaning is computed by using cosine similarity on their context vectors. This allows the choosing of the most similar simplified word to replace a difficult word with.

Also each word pair is replaced with a list of all morphological variants of the two words which are consistent with each other. All nouns are replaced with plural and singular versions (so that if the word to replace is in plural, than the simplified word will also be in plural) and all verbs are replaced with all variants of tenses and persons. The variants keep the same similarity score of the original pairs.

In order to simplify a sentence (which must have at least 7 content words), every word in the sentence, which is not inside quotation marks or parenthesis, is checked for possible simplification.

For every word in the sentence, its context vector inside the sentence is computed just as described above. Now the word is checked for whether it is used precisely because it is difficult, for example when defining a difficult word. An example given in the paper is "The history of the Han ethnic group is closely tied to that of China" in which the word "Han" should not be replaced with "Chinese" as that would make the sentence lose its meaning. In order to detect these cases, the word is checked if it is used in a way that it is commonly used. This is done by checking if the cosine similarity of the word's context vector in the corpus with the word's context vector in the sentence is greater than 0.1 and if so then the word is not simplified.

Next, for every word pair in the substitution list whose first word is the word to simplify, only those whose second word is not in the sentence already are considered. With the remaining substitution candidates, the simpler word is checked for whether it has the same sense as the word it will replace since the same word might have different meanings. To do this, a common context vector is created for both the difficult and simple word. This is done by comparing the value for each word in the corpus context vectors of both words and keeping only the smallest one. This would be the common context vector. Now the cosine similarity of the common context vector with the difficult word's context vector in the sentence is checked for whether it is less than 0.01 and if so then the simple word is not used. Out of all the remaining possible simple words to replace with, the one with the highest similarity is used.

In order to evaluate the simplification system, a baseline was used which just replaces words with their most frequent synonym (assumed to be simpler). Test sentences were found for simplification by searching through the English Wikipedia corpus for all sentences that the system determined that only one word can be simplified, that this word had a more frequent synonym (to allow the baseline to simplify also) and that the replaced word by the system and the baseline were different. Out of these, the sentences were grouped according to the substitution pair that was used to simplify them and then only a random sentence from each group was used in the evaluation. 65 such sentences were found and these were simplified both by the system and by the baseline, resulting in 130 complex/simplified sentences.

Since the groups from which a random sentence was selected were of different sizes, different substitution pairs are applied more frequently than others. In order to see how the frequency of use of the pairs affects the performance of the system, the sentences were grouped into 3 "frequency bands" of roughly equal frequency of use (size of group): 46 in the "high" band, 44 in the "med" band and 40 in the "low".

In order to be evaluated, the complex/simplified sentence pairs (both system and baseline) were given to 3 human judges (native English speakers who are not the authors) who would determine the grammaticality of the simplified sentence (is it still grammatical?) as "bad", "ok" or "good", the meaning of the simplified sentence (does it still mean the same thing?) as "yes" or "no" and the simplification of the simplified sentence (is it simpler?) as "yes" or "no". No two annotators were given both a system and baseline complex/simplified pair and a small portion of the pairs were given to more than one judge in order to check if they agree with each other.

Here is a table showing all the results. Under grammaticality, the percentage in brackets is the percentage of "ok" ratings and the percentage outside the brackets is the percentage of "good" ratings.

Type	Frequency	Grammaticality	Meaning	Simplification
Baseline	High	63.33(+20)%	46.67%	50%
System	High	76.67(+6.66)%	63.33%	73.33%
Baseline	Med	75(+7.14)%	67.86%	42.86%
System	Med	72.41(+17.25)%	75.86%	82.76%
Baseline	Low	73.08(+11.54)%	53.85%	46.15%
System	Low	85.19(+0)%	48.15%	70.37%

Grammaticality is the same because the form of the word is determined using a morphological engine rather than using context. Meaning preservation is affected by the low frequency band because there are less examples to create a quality context vector in order to determine when a particular word can be used in a particular context. Simplification consistently outperforms the baseline.

Sunday, September 30, 2012

The Birthday Paradox

The birthday paradox is the probability of finding two people who share the same birthday in a group of randomly chosen people. The paradox is that it takes surprisingly very few people to get a high probability. This is because you are looking for two people with the same birthday rather than looking for one person with a particular birthday. It assumes that every day of the year has an equal chance of being someone's birthday, which isn't the case since there are more birthdays during the summer months, probably because nine months before then are the autumn months which, I assume, is when couples start spending more time inside and it's not too cold yet.

The probability is also used to find the probability of a hash collision. A hash collision is when a hashing function gives the same hash value for different inputs. One case where this is a problem is when hashes are used to represent contracts which are stored at notaries in order to keep the notary from reading the contract. When the contract needs to be read, the contract is sent to the notary who will then check if the hash is still the same, meaning that it wasn't changed. Of course it could still be changed and give the same hash value, but while this is hard to do for a particular contract, finding a pair of contracts which give the same hash value is easier and so it is important that the probability of this happening is taken into consideration.

So, assuming that birthdays are uniformly distributed across all days of the year and that a year has 365 days, the probability of 2 random people having the same birthday is 1/365 (because it's like asking what are the chances of the second person's birthday being on a particular date) whilst the probability of two people among 366 random people, or more, having the same birthday is 1 (because there are only 365 possible birthdays). But what is the probability of anything in between?

If we have "n" people, then the question is what is the probability of at least one of the those people having a birthday which is the same as another different person among the "n" people. In order to avoid checking for when 2 people share the same birthday, or 3 people, or 4 and so on, we will instead be checking what the probability of no one in the group sharing a birthday is and then subtract it from 1 in order to find what the probability of that not happening is. So we will find what the probability of no one sharing a birthday not happening is.

So what is the probability that no one among 3 people shares a birthday? The first person can have any birthday they want so their birthday can be in any day of the year, giving them a probability of 365/365 of having the right birthday. The second person can have any birthday except the one of the first, so they have a probability of 364/365 of having the right birthday. The third can have any birthday except the ones of the first and the second, so they have a probability of 363/365 of having the right birthday. All these events must happen and they are independent (the birthday of one person will not change the probability of any of the other persons having a right birthday), so we just multiply them together. However we want this whole event to not happen so that we know the probability of finding at least two people who share a birthday, so:
P(3) = 1 - ( 365/365 x 364/365 x 363/365 )
P(3) = 1 - ( (365 x 364 x 363)/(365^3) )
P(3) ≈ 0.00548

What about for 4 people?
P(4) = 1 - ( 365/365 x 364/365 x 363/365 x 362/365 )
P(4) = 1 - ( (365 x 364 x 363 x 362)/(365^4) )
P(4) ≈ 0.01636

What about for "n" people?
P(n) = 1 - ( 365/365 x 364/365 x 363/365 x ... x (356 - n + 1)/365 )
P(n) = 1 - ( (365 x 364 x 363 x ... x (365 - n + 1))/(365^n) )

We can simplify this further by using factorials. Since 7! = 7x6x5x4x3x2x1 and 4! = 4x3x2x1, we can find what 7x6x5 is by calculating 7!/4!, so that parts in 7! that we don't want get cancelled out by 4!. In general,
n x (n-1) x (n-2) x ... x (n-k+1) = n!/(n-k)!
So we can no use this to continue the simplification:

P(n) = 1 - ( (365 x 364 x 363 x ... x (365 - n + 1))/(365^n) )
P(n) = 1 - ( (365!/(365-n)!)/(365^n) )

We can further simplify this by using combinations. Since
n!/(k! x (n-k)!) = nCk,
we can turn any a!/(a-b)! into (b! x a!)/(b! x (a-b)!) by multiplying top and bottom by b! and subsequently turn it into b! x (a!/(b! x (a-b)!) which becomes b! x aCb. So now we can use this to continue the simplification:

P(n) = 1 - ( (365!/(365-n)!)/(365^n) )

P(n) = 1 - ( (n! x 365Cn)/(365^n) )

Now that we've simplified the probability formula, we can find out that P(23) is the smallest number of people that you will need in order to have at least a 50% probability of finding two people who share the same birthday. So in a crowd of 23 people you will probably find two who share the same birthday.