## Wednesday, December 26, 2012

### Summary of research paper "Automatic Retrieval and Clustering of Similar Words" by Dekang Lin

This is a summary of the research paper http://webdocs.cs.ualberta.ca/~lindek/papers/acl98.pdf. This paper describes a way to measure the similarity between words using their contexts. It also describes a way to evaluate the similarities by comparing with existing thesauruses.

The way similarity between two words is calculated is by checking how the words are used in a corpus. To check how a word is used, the grammatical relations between the word and other words in the same sentences it is found are used. For example, these are the grammatical relations for the word "dog" in the sentence "I have a brown dog":

(dog obj-of have): it is an object of the verb "have"
(dog det a): it takes the determiner "a"

The way the relations are represented is called "dependency triples". By counting these dependency triples for every word in a corpus, the frequencies of each relation can be used to determine if two words are used in the same way and by extension if they mean the same thing. This is called the Distributional Hypothesis which says that words that words that appear in similar contexts are semantically similar. These dependency triples are collected using a broad coverage dependency parser.

In order to describe how these frequencies are used in calculations, the following notation is used: ||w,r,w'|| which is the frequency of the dependency triple (w r w'), that is, that word "w" is related to word "w'" with the relation "r". For example, ||cell, subj-of, absorb|| is the number of times the word "cell" is used as a subject of the word "absorb" in the corpus. Asterisks are used a wild cards, that is, to say that that it doesn't matter what is instead of the asterisks. For example, ||cell, subj-of, *|| is the number of times the word "cell" is used as a subject of some other word, ||cell, *, absorb|| is the number of times the word "cell" is somehow related to the word "absorb", ||*, subj-of, absorb|| is the number of times a word is used as a subject of the word "absorb" and ||*, *, *|| is total number of dependency triples of any kind.

In order to determine how similar two words are based on the frequency of their relations, a formula from another paper by the same author called Using Syntactic Dependency as Local Context to Resolve Word Sense Ambiguity (http://aclweb.org/anthology-new/P/P97/P97-1009.pdf) is used which says that

similarity(A, B) = log P(common(A, B)) / log P(describe(A, B))

This means that the similarity between two words A and B is the amount of information needed to describe what is in common between both words divided by the amount of information needed to describe both words. The less the words have information in common, the less similar they are.

The amount of information needed to describe a word is the sum of the information in each of its dependency triples. The amount of information needed to describe a dependency triple (w, r, w') is -log(actual probability that a randomly selected dependency triple is (w, r, w')) - -log(expected probability that a randomly selected dependency triple is (w, r, w')).

The actual probability of (w, r, w') is ||w, r, w'|| / ||*, *, *||.

The expected probability of (w, r, w') can be estimated using the probability that a randomly selected relation is "r", that a randomly selected word is "w" given that the relation is "r" and that another randomly selected word is "w'" given that the relation is "r". These probabilities are assumed to be independent and are simply multiplied together to find the probability of the whole dependency triple. This probability is P(r)P(w|r)P(w'|r) which equals (||*, r, *|| / ||*, *, *||) × (||w, r, *|| / ||*, r, *||) × (||*, r, w'|| / ||*, r, *||).

So the amount of information needed to describe a dependency triple is I(w,r,w') = log((||w,r,w'|| × ||*,r,*||) / (||w,r,*|| × ||*,r,w'||))

We can now finally define the function which gives the similarity between two words.

Let "T(w)" be a function which when given a word "w" will return the set of pairs "(r, w')" which make I(w,r,w') positive.

The similarity between 2 words, "w1" and "w2" is given by the function sim(w1, w2) = sum(I(w1, r, w) + I(w2, r, w) for (r,w) in (T(w1) ∩ T(w2))) / (sum(I(w1, r, w) for (r,w) in T(w1)) + sum(I(w2, r, w) for (r,w) in T(w2)))

With our function sim(w1, w2) we can now start measuring how similar two words are based on their relations to other words.

A 64 million word corpus was parsed using a broad coverage dependency parser which extracted 56.5 million dependency triples. For each noun, verb, adjective and adverb, the top 200 most similar words were found. The resulting thesaurus of similar words consists of these 200 most similar words for each word sorted in descending order of their similarity together with their similarity, for example:

brief (noun): affidavit 0.13, petition 0.05, memorandum 0.05, motion 0.05, lawsuit 0.05, deposition 0.05, slight 0.05, prospectus 0.04, document 0.04, paper 0.04 ...


In order to determine the quality of the similarity measure, its values were compared to existing thesauruses Roget and WordNet in order to see if the similar words determined by the measure are the same as the similar words in the thesauruses. Other similarity measures were also used in order to compare them all together.

These other similarity measures were:

Cosine similarity:
sim_cos(w1, w2) = |T(w1) ∩ T(w2)| / sqrt(|T(w1)| × |T(w2)|)

Hindle similarity which is the same as that proposed in a paper called NOUN CLASSIFICATION FROM PREDICATE ARGUMENT STRUCTURES except that it does not use dependency triples which have negative information (the measure only uses dependency triples whose relation is of subject of and object of):
sim_hindle(w1, w2) = sum(min(I(w1, r, w), I(w2, r, w)) for (r,w) in (T(w1) ∩ T(w2)) if r in {subj-of, obj-of})

and Hindle_r which is the same as Hindle except that all dependency relations are used:
sim_hindle_r(w1, w2) = sum(min(I(w1, r, w), I(w2, r, w)) for (r,w) in (T(w1) ∩ T(w2)))

In order to compare the similarities given by these measures with the thesauruses, the thesauruses needed to have a quantified similarity associated between their similar words. For this reason a similarity measure which gives the similarity between two words according to the thesauruses were developed.

For WordNet, the similarity measures use super classes of the different senses of words. Every word in WordNet belongs to some category of words called a class, here is an example:

Next to each class is the probability that a randomly selected word belongs to that class, which is done by counting the number of words in each class. In order to quantify the similarity between words in WordNet, the most specific class which is common between the two words is used, which in the case of "hill" and "coast" would be "geological formation". By finding the probability of this class for every different sense (meaning) of the two words, converting the probabilities of those classes into information and taking the maximum of the informations, the similarity of the two words would be computed. Formally, this is as follows:
sim_WN(w1, w2) = max(2 log(P(most_specific_common_class(s1, s2))) / (log(P(s1)) + log(P(s2))) for s1 in senses(w1) for s2 in senses(w2))

For Roget, the Roget categories of the words are used. The more words are in common of both categories, the more similar the words are. This is measured as such:
sim_Roget(w1, w2) = 2|RogetCategory(w1) ∩ RogetCategory(w2)| / (|RogetCategory(w1)| + |RogetCategory(w2)|)

Once these measures were defined, the thesauruses were transformed into a form similar to the generated ones. Here is an example:

brief (noun) according to "sim" thesaurus: affidavit 0.13, petition 0.05, memorandum 0.05, motion 0.05, lawsuit 0.05, deposition 0.05, slight 0.05, prospectus 0.04, document 0.04, paper 0.04

brief (noun) according to "sim_WN" thesaurus: outline 0.96, instrument 0.84, summary 0.84, affidavit 0.80, deposition 0.80, law 0.77, survey 0.74, sketch 0.74, resume 0.74, argument 0.74

brief (noun) according to "sim_Roget" thesaurus: recital 0.77, saga 0.77, autobiography 0.77, anecdote 0.77, novel 0.77, novelist 0.77, tradition 0.70, historian 0.70, tale 0.64


In order to determine the similarity between each set of similar words, the following formula was used:
entries_similarity(entries1, entries2) = sum(entry1.similarity × entry2.similarity for entry1 in entries1 for entry2 in entries2 if entry1.word == entry2.word) / sqrt(sum(entry1.similarity^2 for entry1 in entries1) × sum(entry2.similarity^2 for entry2 in entries2))

For example, the similarity between the entries for "brief" using "sim" thesaurus and using "sim_WN" thesaurus is:
sum(0.13 × 0.80, 0.05 × 0.80) / sqrt(sum(0.13^2, 0.05^2, 0.05^2, 0.05^2, 0.05^2, 0.05^2, 0.05^2, 0.04^2, 0.04^2, 0.04^2) × sum(0.96^2, 0.84^2, 0.84^2, 0.80^2, 0.80^2, 0.77^2, 0.74^2, 0.74^2, 0.74^2, 0.74^2)) = 0.297215762

The thesauruses were compared and only words which occurred more than 100 times and which were found in all thesauruses were used to evaluate. The average similarity between entries in the thesauruses was calculated.

When compared to WordNet entries:
sim 0.212199
Hindle 0.204179
cosine 0.199402
Roget 0.178397
Hindle_r 0.164716

When compared to Roget entries:
WordNet 0.178397
sim 0.149045
Hindle 0.14663
cosine 0.135697
Hindle_r 0.115489

It can be seen that the "sim" thesuarus is always better than the other generated thesauruses, but WordNet is closer to Roget than the "sim" thesaurus.

## Wednesday, December 12, 2012

### An equation for enumerating integers

In my last post I described Cantor's way of proving that some infinite lists are bigger than others. I said that his is done by showing that a number of infinite lists, which are called countably infinite, can be paired up with every natural number from 0 onward. I shall now describe an equation that gives the integer that is paired up with a particular natural number.

The integers are (..., -2, -1, 0, 1, 2, ...)
We can pair up integers with natural numbers by pairing all the even natural numbers with positive integers and all the odd natural numbers with negative integers, like this:
0:0, 1:-1, 2:1, 3:-2, 4:2, ...

An equation which maps the natural numbers to the integers in this way is as follows:
y = -(x+1)/2*(x mod 2) + x/2*((x+1) mod 2)

"mod" is an operator called a modulo operator which gives the resulting remainder for a number divided by another. For example 4 mod 2 gives 0 and 5 mod 2 gives 1.

"x mod 2" will give 1 when x is odd and 0 when x is even.
"(x+1) mod 2" will give 1 when x is even and 0 when x is odd.

These are used in order to multiply them with terms that will deal with either the odd natural numbers or the even natural numbers.

"-(x+1)/2" will work when x is odd and will return -1 for 1, -2 for 3, -3 for 4, etc.
"x/2" will work when x is even and will return 1 for 2, 2 for 4, 3 for 6, etc.

These are used to pair up the negative integers with the odd natural numbers and the positive integers with the even natural numbers.

"-(x+1)/2*(x mod 2)" will give the negative integers for odd natural numbers and 0 for even natural numbers, as "-(x+1)/2" will be multiplied by 1 for odd naturals and by 0 for even naturals.
"x/2*((x+1) mod 2)" will give the positive integers for even natural numbers and 0 for odd natural numbers, as "x/2" will be multiplied by 1 for even naturals and by 0 for odd naturals.

By adding them together, since both terms will alternate between 0 and their signed integers, the sum will be an alternating sequence of positive and negative numbers for every natural number x. When x is 0, the result of the sum will also be 0 as both terms will be multiplied by 0.

The equation in Python 3 is as follows:
for x in range(20):
print(x, ":", -(x+1)//2*(x % 2) + x//2*((x+1) % 2))


And here is the output:
0 : 0
1 : -1
2 : 1
3 : -2
4 : 2
5 : -3
6 : 3
7 : -4
8 : 4
9 : -5
10 : 5
11 : -6
12 : 6
13 : -7
14 : 7
15 : -8
16 : 8
17 : -9
18 : 9
19 : -10


## Friday, November 16, 2012

### Countable and uncountable infinities

I love this topic of mathematics. Can infinite lists be compared in size? Let's start from finite lists.

How can you tell that the list [a,b,c] is smaller than the list [w,x,y,z]? You can count them both and then compare the sizes, or you can check if every object in the first list can be compared with another different object in the second list. If we try, we'd end up with something like this:
[a:w, b:x, c:y, ?:z]

Since not every object in the first list can be paired up with the objects in the second, we know that the first is smaller. This is related to the "pigeon hole principle", which says that if you try to associate every object in a list with every object in a smaller list, then you'd have to reuse some objects in the smaller list.

So, since we cannot count the number of objects in infinite lists, we can only try to see what happens if we try to pair up the objects in the lists.

Let's start with the simplest case. Is the list of whole numbers starting from 0 [0,1,2,...] larger than the list of whole numbers starting from 1 [1,2,3,...]? Now remember than both lists are infinite in size, that is, there is no last number in either list, so you cannot reach the end in either list. So if we pair up the numbers by first, second, etc, we get this pairing:
[0:1, 1:2, 2:3, 3:4, ...]

So it is clear that every object can be paired up in both lists and so both lists are of equal size. You might not think that this makes sense because clearly the first list contains all of the numbers in the second list. That's true, but it is also true that no number will every be left out if we try to pair them up, so determining which list if bigger by checking which list contains which is not useful in infinite lists.

A more accurate proof for showing that both lists are of equal size is by writing the equation which gives the number in the second list that is paired up with a particular number in the first list. In this case, if we call the number in the first list y and the number in the second list x, such that our pairing list would be [(x:y), (x:y), (x:y), ...], then y = x+1.

What about the list of even numbers [0,2,4,...] and the list of whole numbers starting from 0 [0,1,2,...]? Well we can pair up the even numbers with their halves in the second list, like this:
[0:0, 1:2, 2:4, 3:6, ...]

Will any number be left out? Doesn't look like it will, given that there is no last number in either list.

In this case, the relationship between each pair is y = 2x.

What about the list of positive and negative whole numbers [...,-2,-1,0,1,2,...] and the list of whole numbers starting from 0 [0,1,2,...]? We can pair up the zeros together and then pair up the positive numbers with the even numbers and the negative numbers with the odd numbers, like this:
[0:0, 1:-1, 2:1, 3:-2, 4:2, ...]

Can you think of an equation which gives you which number in the second list is mapped with a particular number in the first? It requires the use of the modulo operator which gives the remainder that would result when two numbers are divided together. For example 3 mod 2 gives 1 and 4 mod 2 gives 0.

What about fractions [1/1, 1/2, 1/3, ..., 2/1, 2/2, 2/3, ..., ...] with whole numbers starting from 0 [0,1,2,...]? Well, the way we ordered the fractions makes it look impossible, but if we look at the fractions as a grid like this:
1/1 1/2 1/3 1/4 ...
2/1 2/2 2/2 2/3 ...
3/1 3/2 3/2 4/3 ...
4/1 4/2 4/2 3/3 ...
...

Then we see that what we were doing was looking at the fractions row by row. But we can instead look at the diagonals. We start from the corner 1/1, then we take 2/1 and 1/2, then 3/1, 2/2 and 1/3, and we keep on going from one diagonal to the next. We end up with this pairing:
[0:1/1, 1:2/1, 2:1/2, 3:3/1, 4:2/2, ...]

Can you think of an equation to determine which fraction is paired up with a particular whole number? It requires the use of "triangular numbers" in order to know which whole number starts a new diagonal in the grid.

What about the list of all triples of numbers [(1,1,1), (1,1,2), ..., (1,2,1), (1,2,2), ..., (1,3,1), (1,3,2), ..., (2,1,1), (2,1,2), ...] and the list of all whole numbers starting from 0 [0,1,2,...]? Again, this is just a matter of changing the order just like the fractions. This time instead of a grid we have a cube of triples, which cannot be shown in text. So instead we need to order them in diagonal squares around the corner of the cube.

Can you think of an equation to determine which triple, quadruple or any size of number group will be paired with a particular whole number?

You may be thinking that any infinite list is of equal size to the whole numbers starting from zero. But what about the list of all numbers, including irrational numbers, that is, numbers which never end after the decimal point such as pi?

Let's assume that there is such a pairing. For example:
[0:0.000000000..., 1:0.100000000..., 2:1.002040000..., ...]
Can you think of a number in the list of all numbers which will not be a part of the pairings? Since the list of pairings is infinite in size and even the number of digits in each number in the list is, then we have an infinite grid, like this:
0:0.000000000...
1:0.100000000...
2:1.002040000...
...


Since we have this infinite grid of digits, we can create the following number: Take the diagonal of the grid going from the top left corner downwards. If the next digit in the diagonal is a 0, the next digit in our number must be a 9, but if it is any other digit then the next digit in our number must be a 0. In this way our new number will be different from every number in the grid. Here's an example:
0:0.000000000...
1:0.100000000...
2:1.002040000...
...


0:0.#########...
1:#.1########...
2:#.#0#######...
...


First digit in the diagonal is a 0, so the first digit in our new number must be a 9. Second digit in the diagonal is a 1, so the second digit in our new number must be a 0, third digit must be a 9, etc. So our new number in this case will look like this:
9.09...
This number will not be in the grid. Even if we included it, we can find another number which isn't in the grid using the same method. No matter how many new numbers we add to the grid, we can always find another number which isn't in the grid.

So there is no way we can pair up all the numbers with all the whole numbers starting from 0, or with all the whole numbers (positive and negative) or all the fractions. You can always find numbers which weren't paired up.

Infinite lists which can be paired up with all the whole numbers are called "countable infinite lists". These lists have a way to be ordered in such a way that no object in the list will be left out if you check every object in the list from the beginning. Infinite lists which cannot be paired up in this way are called "uncountable infinite lists". These lists cannot be ordered in such a way that you can reach any object by moving forward from the start.

This is a very interesting concept which can be extended to think about if every number can be generated using computer programs, like pi can. The number of computer programs is countable but the number of numbers isn't, so no.

What to learn more? Look up Georg Cantor who came up with these concepts.

## Monday, October 8, 2012

### Summary of research paper "Putting it Simply: a Context-Aware Approach to Lexical Simplification" by Or Biran, Samuel Brody and Nomie Elhadad

This is a summary of the research paper www.aclweb.org/anthology/P11-2087. This paper describes a way to lexically simplify text using unsupervised methods. Lexical simplification is the process of replacing difficult words with simpler words which approximately mean that same thing in context. The method described in this papers does not require a parallel corpus and instead only requires text from complex and simplified documents in order to compare the word counts in each.

The first thing that was done was to download Wikipedia and Simple English Wikipedia articles (3,266,245 articles and 60,100 articles respectively) the following preprocessing tasks were performed on them:

• HTML tags were removed
• Tables and figures were removed (with associated text)
• Text was tokenized
• All word were turned to lower case
• Sentences were split

The last 3 steps were carried out using the Stanford NLP Package.

The system consists of first learning simplification rules (finding pairs of difficult/simpler words) and then applying them by finding which is the best word to replace a difficult word. We shall now look at each of these stages.

To learn the simplification rules, a 10-token window (10 tokens on each side) which crosses sentence boundaries (confirmed through private communication with the authors) was used to construct a context vector of each token (excluding stop words, numbers and punctuation) in the English Wikipedia corpus. All numbers were represented with an artificial token such as "<number>" which allows you to count all instances of any number next to a token. Only frequencies greater than 2 were kept in the context vector.

In order to create a list of possible word substitutions, first a Cartesian product of all words in the English Wikipedia corpus was created which was then filtered in the following ways in order to keep only word pairs which are simplification replacements:
• All morphological variants where removed by checking their lemma with MorphAdorner.
• In order to find more morphological variants which are not handled by MorphAdorner, all words were one is a prefix of the other followed by the suffixes "s", "es", "ed", "ly", "er", "ing" were also removed.
• WordNet was used to remove all word pairs where the first word is not a synonym or hypernym of each second.
• In order to keep only word pairs where the second word is simpler than the first, lexical complexity is measured as the frequency of the word in the English Wikipedia corpus is divided by the frequency of the word in the Simple English Wikipedia corpus and the ratio is multiplied by the length of the word. The reasoning is that the more often a word appears in the Simple English Wikipedia than it does in English Wikipedia and the shorter the word is, the simpler it is and so the smaller it's complexity measure. All word pairs where the second word is harder than the first are removed.

Now we have the substitution list. In order to make the list more practical, each word pair's similarity in meaning is computed by using cosine similarity on their context vectors. This allows the choosing of the most similar simplified word to replace a difficult word with.

Also each word pair is replaced with a list of all morphological variants of the two words which are consistent with each other. All nouns are replaced with plural and singular versions (so that if the word to replace is in plural, than the simplified word will also be in plural) and all verbs are replaced with all variants of tenses and persons. The variants keep the same similarity score of the original pairs.

In order to simplify a sentence (which must have at least 7 content words), every word in the sentence, which is not inside quotation marks or parenthesis, is checked for possible simplification.

For every word in the sentence, its context vector inside the sentence is computed just as described above. Now the word is checked for whether it is used precisely because it is difficult, for example when defining a difficult word. An example given in the paper is "The history of the Han ethnic group is closely tied to that of China" in which the word "Han" should not be replaced with "Chinese" as that would make the sentence lose its meaning. In order to detect these cases, the word is checked if it is used in a way that it is commonly used. This is done by checking if the cosine similarity of the word's context vector in the corpus with the word's context vector in the sentence is greater than 0.1 and if so then the word is not simplified.

Next, for every word pair in the substitution list whose first word is the word to simplify, only those whose second word is not in the sentence already are considered. With the remaining substitution candidates, the simpler word is checked for whether it has the same sense as the word it will replace since the same word might have different meanings. To do this, a common context vector is created for both the difficult and simple word. This is done by comparing the value for each word in the corpus context vectors of both words and keeping only the smallest one. This would be the common context vector. Now the cosine similarity of the common context vector with the difficult word's context vector in the sentence is checked for whether it is less than 0.01 and if so then the simple word is not used. Out of all the remaining possible simple words to replace with, the one with the highest similarity is used.

In order to evaluate the simplification system, a baseline was used which just replaces words with their most frequent synonym (assumed to be simpler). Test sentences were found for simplification by searching through the English Wikipedia corpus for all sentences that the system determined that only one word can be simplified, that this word had a more frequent synonym (to allow the baseline to simplify also) and that the replaced word by the system and the baseline were different. Out of these, the sentences were grouped according to the substitution pair that was used to simplify them and then only a random sentence from each group was used in the evaluation. 65 such sentences were found and these were simplified both by the system and by the baseline, resulting in 130 complex/simplified sentences.

Since the groups from which a random sentence was selected were of different sizes, different substitution pairs are applied more frequently than others. In order to see how the frequency of use of the pairs affects the performance of the system, the sentences were grouped into 3 "frequency bands" of roughly equal frequency of use (size of group): 46 in the "high" band, 44 in the "med" band and 40 in the "low".

In order to be evaluated, the complex/simplified sentence pairs (both system and baseline) were given to 3 human judges (native English speakers who are not the authors) who would determine the grammaticality of the simplified sentence (is it still grammatical?) as "bad", "ok" or "good", the meaning of the simplified sentence (does it still mean the same thing?) as "yes" or "no" and the simplification of the simplified sentence (is it simpler?) as "yes" or "no". No two annotators were given both a system and baseline complex/simplified pair and a small portion of the pairs were given to more than one judge in order to check if they agree with each other.

Here is a table showing all the results. Under grammaticality, the percentage in brackets is the percentage of "ok" ratings and the percentage outside the brackets is the percentage of "good" ratings.

TypeFrequencyGrammaticalityMeaningSimplification
BaselineHigh63.33(+20)%46.67%50%
SystemHigh76.67(+6.66)%63.33%73.33%
BaselineMed75(+7.14)%67.86%42.86%
SystemMed72.41(+17.25)%75.86%82.76%
BaselineLow73.08(+11.54)%53.85%46.15%
SystemLow85.19(+0)%48.15%70.37%

Grammaticality is the same because the form of the word is determined using a morphological engine rather than using context. Meaning preservation is affected by the low frequency band because there are less examples to create a quality context vector in order to determine when a particular word can be used in a particular context. Simplification consistently outperforms the baseline.

## Sunday, September 30, 2012

The birthday paradox is the probability of finding two people who share the same birthday in a group of randomly chosen people. The paradox is that it takes surprisingly very few people to get a high probability. This is because you are looking for two people with the same birthday rather than looking for one person with a particular birthday. It assumes that every day of the year has an equal chance of being someone's birthday, which isn't the case since there are more birthdays during the summer months, probably because nine months before then are the autumn months which, I assume, is when couples start spending more time inside and it's not too cold yet.

The probability is also used to find the probability of a hash collision. A hash collision is when a hashing function gives the same hash value for different inputs. One case where this is a problem is when hashes are used to represent contracts which are stored at notaries in order to keep the notary from reading the contract. When the contract needs to be read, the contract is sent to the notary who will then check if the hash is still the same, meaning that it wasn't changed. Of course it could still be changed and give the same hash value, but while this is hard to do for a particular contract, finding a pair of contracts which give the same hash value is easier and so it is important that the probability of this happening is taken into consideration.

So, assuming that birthdays are uniformly distributed across all days of the year and that a year has 365 days, the probability of 2 random people having the same birthday is 1/365 (because it's like asking what are the chances of the second person's birthday being on a particular date) whilst the probability of two people among 366 random people, or more, having the same birthday is 1 (because there are only 365 possible birthdays). But what is the probability of anything in between?

If we have "n" people, then the question is what is the probability of at least one of the those people having a birthday which is the same as another different person among the "n" people. In order to avoid checking for when 2 people share the same birthday, or 3 people, or 4 and so on, we will instead be checking what the probability of no one in the group sharing a birthday is and then subtract it from 1 in order to find what the probability of that not happening is. So we will find what the probability of no one sharing a birthday not happening is.

So what is the probability that no one among 3 people shares a birthday? The first person can have any birthday they want so their birthday can be in any day of the year, giving them a probability of 365/365 of having the right birthday. The second person can have any birthday except the one of the first, so they have a probability of 364/365 of having the right birthday. The third can have any birthday except the ones of the first and the second, so they have a probability of 363/365 of having the right birthday. All these events must happen and they are independent (the birthday of one person will not change the probability of any of the other persons having a right birthday), so we just multiply them together. However we want this whole event to not happen so that we know the probability of finding at least two people who share a birthday, so:
P(3) = 1 - ( 365/365 x 364/365 x 363/365 )
P(3) = 1 - ( (365 x 364 x 363)/(365^3) )
P(3) ≈ 0.00548

P(4) = 1 - ( 365/365 x 364/365 x 363/365 x 362/365 )
P(4) = 1 - ( (365 x 364 x 363 x 362)/(365^4) )
P(4) ≈ 0.01636

P(n) = 1 - ( 365/365 x 364/365 x 363/365 x ... x (356 - n + 1)/365 )
P(n) = 1 - ( (365 x 364 x 363 x ... x (365 - n + 1))/(365^n) )

We can simplify this further by using factorials. Since 7! = 7x6x5x4x3x2x1 and 4! = 4x3x2x1, we can find what 7x6x5 is by calculating 7!/4!, so that parts in 7! that we don't want get cancelled out by 4!. In general,
n x (n-1) x (n-2) x ... x (n-k+1) = n!/(n-k)!
So we can no use this to continue the simplification:

P(n) = 1 - ( (365 x 364 x 363 x ... x (365 - n + 1))/(365^n) )
P(n) = 1 - ( (365!/(365-n)!)/(365^n) )

We can further simplify this by using combinations. Since
n!/(k! x (n-k)!) = nCk,
we can turn any a!/(a-b)! into (b! x a!)/(b! x (a-b)!) by multiplying top and bottom by b! and subsequently turn it into b! x (a!/(b! x (a-b)!) which becomes b! x aCb. So now we can use this to continue the simplification:

P(n) = 1 - ( (365!/(365-n)!)/(365^n) )
P(n) = 1 - ( (n! x 365Cn)/(365^n) )

Now that we've simplified the probability formula, we can find out that P(23) is the smallest number of people that you will need in order to have at least a 50% probability of finding two people who share the same birthday. So in a crowd of 23 people you will probably find two who share the same birthday.

## Tuesday, August 21, 2012

### Transaction rollback not working on MySQL

Transactions in MySQL as described in http://dev.mysql.com/doc/refman/5.0/en/commit.html do not work in MyISAM tables. You need to use a transaction safe storage engine such as InnoDB, otherwise you can rollback all you like as it will not have any effect.

## Wednesday, July 4, 2012

### Java Trie

After receiving so many visits to my early C# Trie post, I am now releasing source code for a Java Trie, which works in exactly the same way as the C# Trie. The comments have been fixed (I discovered lots of incomplete comments) and some optimizations have been implemented. The code will just be pasted in this post rather than placed in a repository like the C# Trie was. More information can be found in the C# Trie post, but notice that IPrefixMatcher was renamed to PrefixMatcher and PrefixMatcher was renamed to DefaultPrefixMatcher.

Trie class
/*
* Copyright (c) 2012 Marc Tanti (www.marctanti.com)
*
* Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/
package trie;

/**
* Trie data structure which maps strings and string prefixes to values.
* In this implementation, only one PrefixMatcher can be used but it can be modified to use a list of concurrent prefix matchers.
* @param <V> Type of values to map to.
*/
public class Trie <V> {

/**
* The root of the trie.
*/
private TrieNode<V> root;

/**
* PrefixMatcher object for following the prefix of the strings in the trie to their associated values.
*/
private PrefixMatcher<V> prefixMatcher;

/**
* Create an empty trie with an empty root node and an unused prefix matcher.
*/
public Trie()
{
this.root = new TrieNode<V>(null, '\0');
this.prefixMatcher = new DefaultPrefixMatcher<V>(this.root);
}

/**
* Get the prefix matcher to search for strings with in a character by character basis.
* @return The prefix matcher. WARNING: There is only one prefix matcher per trie and calling this method will always return the same instance. The instance will retain its previous state.
*/
public PrefixMatcher<V> getPrefixMatcher() {
return this.prefixMatcher;
}

/**
* Put a new key-value pair, overwriting the existing value if the given key is already in use.
* @param key The full string which is associated with the value.
* @param value The value which is associated with the key.
*/
public void put(String key, V value)
{
TrieNode<V> node = this.root;
for (char c : key.toCharArray())
{
}
node.setValue(value);
}

/**
* Remove the value that a key leads to and any redundant nodes which result from this deletion.
* Clears the current matching process.
* @param key Key of the value to remove.
*/
public void remove(String key)
{
TrieNode<V> node = this.root;
for (char c : key.toCharArray())
{
node = node.getChild(c);
}
node.setValue(null);

//Remove all ancestor nodes which lead to this null value.
while (node != this.root && !node.isTerminater() && node.numChildren() == 1)
{
char prevKey = node.getKey();
node = node.getParent();
node.removeChild(prevKey);
}

this.prefixMatcher.resetMatch();
}

}


TrieNode class
/*
* Copyright (c) 2012 Marc Tanti (www.marctanti.com)
*
* Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/
package trie;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;

/**
* Node of the trie.
* Stores a link to multiple children of type TrieNode, each associated with a key of type Character.
* If the node terminates a string then it must also store a non-null value of type V.
* @param <V> The type of the value associated with a complete string.
*/
public class TrieNode <V> {

/**
* The value stored by this node. If not null then the node terminates a string.
*/
private V value;

/**
* The key which is associated with this node, that is, the character that led to this node.
*/
private char key;

/**
* The parent of this node.
*/
private TrieNode<V> parent;

/**
* The children of this node which can be found through their associated character.
*/
private HashMap<Character, TrieNode<V>> children;

/**
* Create an empty node with no children and null value.
* @param key The key which is associated with this node, that is, the character that led to this node.
* @param parent The parent node of this node.
*/
public TrieNode(TrieNode<V> parent, char key) {
this.key = key;
this.parent = parent;
this.value = null;
this.children = new HashMap<Character, TrieNode<V>>();
}

/**
* Get the value stored by this node. If the value is not null then the node terminates a string.
* @return The value stored by this node.
*/
public V getValue() {
return value;
}

/**
* Set the value stored by this node. If the value is not null then the node terminates a string.
* @param value The value stored by this node.
*/
public void setValue(V value) {
this.value = value;
}

/**
* Get the key which is associated with this node, that is, the character that led to this node.
* @return The key which is associated with this node.
*/
public char getKey() {
return key;
}

/**
* Get the parent of this node.
* @return The parent of this node.
*/
public TrieNode<V> getParent() {
return parent;
}

/**
* Get a child of this node which is associated with a key.
* @param key Key associated with the child of interest.
* @return The child or null if no child is associated with the given key.
*/
public TrieNode<V> getChild(char key) {
if (this.children.containsKey(key)) {
return this.children.get(key);
}
return null;
}

/**
* Check whether or not this node terminates a string and stores a value.
* @return Whether node stores a value.
*/
public boolean isTerminater() {
return this.value != null;
}

/**
* Get the number of children that this node has.
* @return The number of children.
*/
public int numChildren() {
return this.children.size();
}

/**
* Check whether or not this node has any children.
* @return True if node does not have children, false otherwise.
*/
public boolean isLeaf() {
return this.numChildren() == 0;
}

/**
* Check whether or not one of the children of this node is associated with the given key.
* @param key The key to check for.
* @return True if a child with given key exists, false otherwise.
*/
public boolean containsKey(char key) {
return this.children.containsKey(key);
}

/**
* Add an empty child node (associated with a key) to this node and return the node.
* @param key The key to associate with the empty child node.
* @return If the given key already exists then return the existing child node, otherwise return the new child node.
*/
if (this.children.containsKey(key)) {
return this.children.get(key);
} else {
TrieNode<V> newChild = new TrieNode<V>(this, key);
this.children.put(key, newChild);
return newChild;
}
}

/**
* Remove the child of a node associated with a key along with all its descendents.
* @param key The key associated with the child to remove.
*/
public void removeChild(char key) {
this.children.remove(key);
}

/**
* Get a list of values contained in this node and all its descendants.
* Since all the descendants share the same string prefix as this node, then all the values returned will belong to all the strings with this node's prefix.
* @return A List of values.
*/
public List<V> prefixMatches() {
ArrayList<V> values = new ArrayList<V>();

for (TrieNode <V> child : this.children.values()) {
}

if (this.isTerminater()) {
}

return values;
}
}


DefaultPrefixMatcher class
/*
* Copyright (c) 2012 Marc Tanti (www.marctanti.com)
*
* Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/
package trie;

import java.util.List;

/**
* This class is used to search through a trie for all the strings which have a given prefix. The prefix will be entered character by character.
* The object from this class is used as a state to mark where the currently entered prefixed has arrived in the trie thus far.
* This class makes it possible to have a number of matcher objects all concurrently following prefixes in the trie.
* @param <V> The type of the value associated with a complete string.
*/
public class DefaultPrefixMatcher <V> implements PrefixMatcher<V> {

/**
* The root of the trie being searched.
*/
private TrieNode<V> root;

/**
* The trie node that the prefix entered thus far has led to.
*/
private TrieNode<V> currNode;

/**
* The full prefix entered thus far.
*/
private StringBuffer prefix;

/**
* Create a matcher, associating it to the trie to search in.
* @param root Root node of the trie which the matcher will search in.
*/
public DefaultPrefixMatcher(TrieNode<V> root)
{
this.root = root;
this.currNode = root;
this.prefix = new StringBuffer();
}

/**
* Get the prefix entered so far.
* @return The prefix entered so far.
*/
@Override
public String getPrefix() {
return this.prefix.toString();
}

/**
* Clear the currently entered prefix.
*/
@Override
public void resetMatch() {
this.currNode = this.root;
this.prefix = new StringBuffer();
}

/**
* Remove the last character of the currently entered prefix.
*/
@Override
public void backMatch() {
if (this.currNode != this.root)
{
this.currNode = this.currNode.getParent();
this.prefix.deleteCharAt(this.prefix.length() - 1);
}
}

/**
* Get the last character of the currently entered prefix.
* @return The last character of the currently entered prefix.
*/
@Override
public char lastMatch() {
return this.currNode.getKey();
}

/**
* Add another character to the end of the prefix if it leads to some existing strings in the trie.
* If no strings have a matching prefix, character will not be added.
* @param next Next character in the prefix.
* @return True if the new prefix was a prefix to some strings in the trie, false otherwise.
*/
@Override
public boolean nextMatch(char next) {
if (this.currNode.containsKey(next))
{
this.currNode = this.currNode.getChild(next);
this.prefix.append(next);
return true;
}
return false;
}

/**
* Get all the values associated with strings that share the currently entered prefix.
* @return The list of values.
*/
@Override
public List<V> getPrefixMatches() {
return this.currNode.prefixMatches();
}

/**
* Check if the currently entered prefix is a complete existing string in the trie.
* @return True if the currently entered prefix is an existing string, false otherwise.
*/
@Override
public boolean isExactMatch() {
return this.currNode.isTerminater();
}

/**
* Get the value associate with the complete string that equals the prefix entered thus far.
* @return The value if the current entered prefix is an existing string, null otherwise.
*/
@Override
public V getExactMatch() {
if (this.isExactMatch()) {
return this.currNode.getValue();
}
return null;
}

}


PrefixMatcher interface
/*
* Copyright (c) 2012 Marc Tanti (www.marctanti.com)
*
* Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/
package trie;

import java.util.List;

/**
* A simplified view of the concrete prefix matcher class for the user.
* This is used to search through a trie for all the strings which have a given prefix. The prefix will be entered character by character.
* @param <V> The type of the value associated with a complete string.
*/
public interface PrefixMatcher<V> {

/**
* Get the prefix entered so far.
* @return The prefix entered so far.
*/
String getPrefix();

/**
* Clear the currently entered prefix.
*/
void resetMatch();

/**
* Remove the last character of the currently entered prefix.
*/
void backMatch();

/**
* Get the last character of the currently entered prefix.
* @return The last character of the currently entered prefix.
*/
char lastMatch();

/**
* Add another character to the end of the prefix if it leads to some existing strings in the trie.
* If no strings have a matching prefix, character will not be added.
* @param next Next character in the prefix.
* @return True if the new prefix was a prefix to some strings in the trie, false otherwise.
*/
boolean nextMatch(char next);

/**
* Get all the values associated with strings that share the currently entered prefix.
* @return The list of values.
*/
List<V> getPrefixMatches();

/**
* Check if the currently entered prefix is a complete existing string in the trie.
* @return True if the currently entered prefix is an existing string, false otherwise.
*/
boolean isExactMatch();

/**
* Get the value associate with the complete string that equals the prefix entered thus far.
* @return The value.
*/
V getExactMatch();

}


Example
//Create trie
Trie<String> trie = new Trie<String>();

//Add some key-value pairs to the trie
trie.put("James", "1");
trie.put("Jake", "2");
trie.put("Fred", "3");

//Search the trie
trie.getPrefixMatcher().nextMatch('J'); //Prefix thus far: "J"
trie.getPrefixMatcher().getPrefixMatches(); //[1, 2]
trie.getPrefixMatcher().isExactMatch(); //false
trie.getPrefixMatcher().nextMatch('a');
trie.getPrefixMatcher().nextMatch('m'); //Prefix thus far: "Jam"
trie.getPrefixMatcher().getPrefixMatches(); //[1]
trie.getPrefixMatcher().nextMatch('e');
trie.getPrefixMatcher().nextMatch('s'); //Prefix thus far: "James"
trie.getPrefixMatcher().isExactMatch(); //true
trie.getPrefixMatcher().getExactMatch(); //1

//Remove a string-value pair
trie.remove("James");


## Tuesday, June 26, 2012

### Summary of research paper "Simple EnglishWikipedia: A New Text Simplification Task" by William Coster and David Kauchak

This is a summary of the research paper http://www.aclweb.org/anthology-new/P/P11/P11-2117.pdf. This paper describes a way to automatically create a parallel corpus that pairs sentences with a simpler version of the sentences which will be useful for training text simplification systems. The sentences for the corpus come from articles in Wikipedia and Simple English Wikipedia. Simple English Wikipedia is a version of Wikipedia which is edited with the aim of being understandable by children and English learners by encouraging the use of simple vocabulary and grammar. By pairing up equivalent sentences in Wikipedia and Simple English Wikipedia, it was possible to build a 137,000 sentence parallel corpus.

In order to build such a corpus, both Wikipedia sites were downloaded, the articles were paired by title, all pairs that contained an unusable article (e.g. stubs, disambiguation pages, etc.) were removed, the equivalent paragraphs in the article pairs were paired up and finally the equivalent sentences in the paragraph pairs were paired up.

To align paragraphs in the article pairs, where a single paragraph in Simple English Wikipedia could correspond to multiple paragraphs in Wikipedia but not the other way round, the words in each paragraph from the Simple English Wikipedia article were used as a set of keywords to search for all corresponding paragraphs in the Wikipedia article. This was done by using TF-IDF and keeping only those paragraphs which resulted in a vector with cosine similarity to the Simple English Wikipedia paragraph of 0.5 or more. In this way, each paragraph in the Simple English Wikipedia article is paired up with a list of paragraphs in the Wikipedia article which are similar to it.

Once the paragraph pairs were found in each article pair, the next step was to pair up the sentences in the paragraph pairs such that in each pair you get a Wikipedia sentence and its simplified counterpart from Simple English Wikipedia. Such a pairing of sentences is called an "alignment". This was less simple than how the paragraphs were paired because now there could be multiple sentences in Simple English Wikipedia that correspond to multiple sentences in Wikipedia. Instead, the sentences in the paragraph pairs are paired up in many different ways and the alignment which gives the maximum sum of similarities in each sentence pair was used. The similarity of two sentences in a pair is found by using TF-IDF with cosine similarity. The paragraph pairing was important in order to limit the number of different ways to attempt to align the sentences.

Following some clarifications via email with the author, here is how the sentence alignment algorithm works:

Let "n" be the number of Wikipedia sentences in a paragraph pair
Let "m" be the number of Simple English Wikipedia sentences in a paragraph pair
Let "sim(i,j)" be a function which measures the similarity between the "i"th Wikipedia sentence and the "j"th Simple English Wikipedia sentence (using TF-IDF with cosine similarity) in a paragraph pair

I shall now introduce a new function "align(n,m)" which is not mentioned in the paper but which will help explain the whole sentence alignment concept.

The function "align(n,m)" gives the best sentence alignment such that the sum of the similarity between each sentence pair is maximized. In order to align the sentences, 6 operations are successively applied to the sentences in the paired paragraphs. The operations are:
• Skip the next sentence in the Wikipedia paragraph (it does not correspond to any of the sentences in the Simple English Wikipedia paragraph)
• Skip the next sentence in the Simple English Wikipedia paragraph (it does not correspond to any of the sentences in the Wikipedia paragraph)
• Pairing the next Wikipedia sentence to the next Simple English Wikipedia sentence
• Pairing the next Wikipedia sentence to the next two Simple English Wikipedia sentences
• Pairing the next two Wikipedia sentences to the next Simple English Wikipedia sentence
• Pairing the next two Wikipedia sentences to the next two Simple English Wikipedia sentences crossed up (for example, if the Wikipedia sentences are [a, b] and the Simple English Wikipedia sentences are [x, y] then the pairing would be [(a,y), (b,x)])

These operations are applied successively using dynamic programming in order to find the alignment with the maximum sum of similarities.

We shall first see the function "a(n,m)" which calculates this maximal similarity and then we shall see "align(n,m)". "a(n,m)" will also be applying the operations using dynamic programming, but it will only be calculating the sum of the similarities of the sentence pairs rather then the actual pairs. Having initial arguments being the number of sentences in each paragraph in the paragraph pair, "a(n,m)" will recursively call itself on less and less sentences until it reaches a base case of having no sentences to align. In accordance with the paper, the "i"th sentence is not 0 based like in arrays but 1 based. So the first sentence is in position 1, the second is in position 2, etc.

This is "a(i,j)" in Python without using dynamic programming (for simplicity):

def a(i, j):
if i == 0 or j == 0: #Base case: if there are no more sentences to align in the paragraphs then the similarity is 0
return 0

else: #Recursive case: if there are sentences to align then check what the maximum similarity achieved will be after applying each operation and return the maximum of the similarities
alternatives = []

#Max similarity after skipping next Simple English Wikipedia sentence
alternatives.append(a(i, j-1) - skip_penalty)

#Max similarity after skippin next Wikipedia sentence
alternatives.append(a(i-1, j) - skip_penalty)

#Max similarity after pairing next Wikipedia sentence with next Simple English Wikipedia sentence
alternatives.append(a(i-1, j-1) + sim(i, j))

if j >= 2: #If there are more than one Simple English Wikipedia sentences then try pairing up two such sentences
#Max similarity after pairing next Wikipedia sentence with next two Simple English Wikipedia sentences
alternatives.append(a(i-1, j-2) + sim(i, j-1) + sim(i, j))

if i >= 2: #If there are more than one Wikipedia sentences then try pairing up two such sentences
#Max similarity after pairing next two Wikipedia sentences with next Simple English Wikipedia sentence
alternatives.append(a(i-2, j-1) + sim(i-1, j) + sim(i, j))

if i >= 2 and j >= 2: #If there are more than one sentences in both paragraphs then try pairing up two of each
#Max similarity after pairing next two Wikipedia sentences with next two Simple English Wikipedia sentences crossed up
alternatives.append(a(i-2, j-2) + sim(i-1, j) + sim(i, j-1))

return max(alternatives) #Return the maximum of the similarities


"skip_penalty" is a constant (the authors set it to 0.0001) which is used to reduce the total similarity every time a sentence is skipped. This is to discourage skipping too many sentences.

Now "align(n,m)" would be very similar to "a(n,m)". Every "sim(x,y)" there is in "a(n,m)" represents a pairing of the "x"th Wikipedia sentence with the "y"th Simple English Wikipedia sentence. So in "align(n,m)", every time we compute "sim(x,y)" we would add the pair "(x,y)" to our alignment. We still need to know what the maximum similarity of the alignment will be after applying each operation so "align(n,m)" needs to return both the similarity and the alignment.

This is "align(i,j)" in Python without using dynamic programming (for simplicity):

def align(i, j):
if i == 0 or j == 0: #Base case: if there are no more sentences to align in the paragraphs then the similarity is 0
return (0, [])

else: #Recursive case: if there are sentences to align then check what the maximum similarity achieved will be after applying each operation and return the maximum of the similarities
alternatives = []

#Best alignment after skipping next Simple English Wikipedia sentence
(similarity, alignment) = align(i, j-1)
alternatives.append((similarity - skip_penalty, alignment))

#Best alignment after skippin next Wikipedia sentence
(similarity, alignment) = align(i-1, j)
alternatives.append((similarity - skip_penalty, alignment))

#Best alignment after pairing next Wikipedia sentence with next Simple English Wikipedia sentence
(similarity, alignment) = align(i-1, j-1)
alternatives.append((similarity + sim(i, j), alignment + [(i, j)]))

if j >= 2: #If there are more than one Simple English Wikipedia sentences then try pairing up two such sentences
#Best alignment after pairing next Wikipedia sentence with next two Simple English Wikipedia sentences
(similarity, alignment) = align(i-1, j-2)
alternatives.append((similarity + sim(i, j-1) + sim(i, j), alignment + [(i, j-1), (i, j)]))

if i >= 2: #If there are more than one Wikipedia sentences then try pairing up two such sentences
#Best alignment after pairing next two Wikipedia sentences with next Simple English Wikipedia sentence
(similarity, alignment) = align(i-2, j-1)
alternatives.append((similarity + sim(i-1, j) + sim(i, j), alignment + [(i-1, j), (i, j)]))

if i >= 2 and j >= 2: #If there are more than one sentences in both paragraphs then try pairing up two of each
#Best alignment after pairing next two Wikipedia sentences with next two Simple English Wikipedia sentences crossed up
(similarity, alignment) = align(i-2, j-2)
alternatives.append((similarity + sim(i-1, j) + sim(i, j-1), alignment + [(i-1, j), (i, j-1)]))

return max(alternatives, key=lambda p:p[0]) #Return the maximum of the similarities (the similarity is the first element of each similarity/alignment pair)


And here is the same algorithm using dynamic programming:
def align(i, j):
memo = dict()

for x in range(0, i+1):
memo[(x, 0)] = (0.0, [])
for y in range(0, j+1):
memo[(0, y)] = (0.0, [])

for y in range(1, j+1):
for x in range(1, i+1):
alternatives = []

#Best alignment after skipping next Simple English Wikipedia sentence
(similarity, alignment) = memo[(x, y-1)]
alternatives.append((similarity - skip_penalty, alignment))

#Best alignment after skippin next Wikipedia sentence
(similarity, alignment) = memo[(x-1, y)]
alternatives.append((similarity - skip_penalty, alignment))

#Best alignment after pairing next Wikipedia sentence with next Simple English Wikipedia sentence
(similarity, alignment) = memo[(x-1, y-1)]
alternatives.append((similarity + sim(x, y), alignment + [(x, y)]))

if y >= 2: #If there are more than one Simple English Wikipedia sentences then try pairing up two such sentences
#Best alignment after pairing next Wikipedia sentence with next two Simple English Wikipedia sentences
(similarity, alignment) = memo[(x-1, y-2)]
alternatives.append((similarity + sim(x, y-1) + sim(x, y), alignment + [(x, y-1), (x, y)]))

if x >= 2: #If there are more than one Wikipedia sentences then try pairing up two such sentences
#Best alignment after pairing next two Wikipedia sentences with next Simple English Wikipedia sentence
(similarity, alignment) = memo[(x-2, y-1)]
alternatives.append((similarity + sim(x-1, y) + sim(x, y), alignment + [(x-1, y), (x, y)]))

if x >= 2 and y >= 2: #If there are more than one sentences in both paragraphs then try pairing up two of each
#Best alignment after pairing next two Wikipedia sentences with next two Simple English Wikipedia sentences crossed up
(similarity, alignment) = memo[(x-2, y-2)]
alternatives.append((similarity + sim(x-1, y) + sim(x, y-1), alignment + [(x-1, y), (x, y-1)]))

memo[(x,y)] = max(alternatives, key=lambda p:p[0]) #Return the maximum of the similarities (the similarity is the first element of each similarity/alignment pair)
return memo[(i,j)]


This function will return the best alignment and its total similarity. The alignment will be a list of sentence position pairs and not the sentence pairs themselves.

Once the sentences in each paragraph were aligned, the sentence pairs were then filtered such that if any sentence pairs were less similar than 0.5 then they would be discarded. The filtering went from 10,000 article pairs to 75,000 paragraph pairs to 137,000 sentence pairs. Out of a sample of 100 sentence pairs, human judges determined that 91 of them were correct sentence simplifications. They were also tested on a phrase based translation system called Moses which simplified test sentences with statistically significant improvement.

The final parallel corpus is available at http://www.cs.pomona.edu/~dkauchak/simplification/.

## Saturday, June 23, 2012

### Summary of research paper "Readability Assessment for Text Simplification" by Sandra Aluisio, Lucia Specia, Caroline Gasperin and Carolina Scarton

This is a summary of the research paper http://aclweb.org/anthology-new/W/W10/W10-1001.pdf. This paper is about a program which classifies text into one of three difficulty of reading levels: "rudimentary", "basic" and "advanced". These difficulty levels are defined by INAF (Portuguese document at http://www.ibope.com.br/ipm/relatorios/relatorio_inaf_2009.pdf), the National Indicator of Functional Literacy. The aim is to use this program to assist authors to detect parts of their writing which can be simplified. The program is designed to work with Portuguese rather than English.

The following list of features is extracted from the Portuguese text to assess using various tools and resources described in the paper (copied directly from the paper):
1. Number of words
2. Number of sentences
3. Number of paragraphs
4. Number of verbs
5. Number of nouns
8. Number of pronouns
9. Average number of words per sentence
10. Average number of sentences per paragraph
11. Average number of syllables per word
12. Flesch index for Portuguese
13. Incidence of content words
14. Incidence of functional words
15. Raw Frequency of content words
16. Minimal frequency of content words
17. Average number of verb hypernyms
18. Incidence of NPs
19. Number of NP modifiers
20. Number of words before the main verb
21. Number of high level constituents
22. Number of personal pronouns
23. Type-token ratio
24. Pronoun-NP ratio
25. Number of “e” (and)
26. Number of “ou” (or)
27. Number of “se” (if)
28. Number of negations
29. Number of logic operators
30. Number of connectives
31. Number of positive additive connectives
32. Number of negative additive connectives
33. Number of positive temporal connectives
34. Number of negative temporal connectives
35. Number of positive causal connectives
36. Number of negative causal connectives
37. Number of positive logic connectives
38. Number of negative logic connectives
39. Verb ambiguity ratio
40. Noun ambiguity ratio
43. Incidence of clauses
45. Incidence of apposition
46. Incidence of passive voice
47. Incidence of relative clauses
48. Incidence of coordination
49. Incidence of subordination
50. Out-of-vocabulary words
51. LM probability of unigrams
52. LM perplexity of unigrams
53. LM perplexity of unigrams, without line break
54. LM probability of bigrams
55. LM perplexity of bigrams
56. LM perplexity of bigrams, without line break
57. LM probability of trigrams
58. LM perplexity of trigrams
59. LM perplexity of trigrams, without line break

The features from 1 to 42 are the "Coh-Metix PORT" feature group and were derived from the Coh-Metrix-PORT tool, which is a Portuguese adaptation of the Coh-Metrix.

The features from 43 to 49 are the "Syntactic" feature group and were added to analyse syntactic constructions which are useful for automatic simplification.

The features from 50 to 59 are the "Language model" feature group and are derived by comparing n-grams in the input text to n-gram frequencies in the language (Portuguese) as well as their perplexity and scores for any words which are not listed in the system's vocabulary.

The features from 1 to 3 and 9 to 11 are also called the "Basic" feature group because they require no linguistic knowledge. Feature 12 is also put into a feature group on its own called "Flesch".

Experiments were performed to check if it is possible to use machine learning techniques to learn to detect the INAF difficulty of reading levels and to discover which features are best to use for complexity detection.

Three types of machine learning algorithms in Weka were evaluated:
• "Standard classification" (SMO): A standard classifier (support vector machines) which assumes that there is no relation between the difficulty levels
• "Ordinal classification" (OrdinalClassClassifier): An ordinal classifier which assumes that the difficulty levels are ordered
• "Regression" (SMO-reg): A regressor which assumes that the difficulty levels are continuous

Manually simplified corpora were used as training data for the algorithms. The corpora were simplified into each of the difficulty levels. Using these simplified corpora it was possible to check if the system could predict that the original corpora are of "advanced" difficulty level, the lightly simplified corpora are of "basic" difficulty level and the heavily simplified corpora are of "rudimentary" difficulty level.

In order to determine the importance of each feature described, the absolute Pearson correlation between each feature of each corpora and how well it predicts the corpora's difficulty level is computed. The results for the best features below are copied from the paper (1=highly correlated, 0=not correlated):
• Words per sentence: 0.693
• Incidence of apposition: 0.688
• Incidence of clauses: 0.614
• Flesch index: 0.580
• Words before main verb: 0.516
• Sentences per paragraph: 0.509

In order to determine which machine learning algorithm is better at classifying the corpora into correct difficulty levels using the features mentioned earlier, the Pearson correlation with true score was used on how the corpora were classified. The experiments were also run using the sub-groups of features which were mentioned to see how important each group was. Here are the results, summarized from the paper:

Standard classification
Feature groupPearson correlation
All0.84
Language model0.25
Basic0.76
Syntactic0.82
Coh-Metrix-PORT0.79
Flesch0.52

Ordinal classification
Feature groupPearson correlation
All0.83
Language model0.49
Basic0.73
Syntactic0.81
Coh-Metrix-PORT0.8
Flesch0.56

Regression
Feature groupPearson correlation
All0.8502
Language model0.6245
Basic0.7266
Syntactic0.8063
Coh-Metrix-PORT0.8051
Flesch0.5772

It is clear that it is better to use all the features together than any sub-group of them, but the syntactic features, followed by the Coh-Metrix-PORT features are the most useful feature groups and the language model feature group was the worst.

The simple classification algorithm was chosen as the best because although it is the simplest algorithm, its results are comparable to the other algorithms' results.

## Sunday, June 17, 2012

### Rummy and the most probable sequence in a probability tree

Rummy
Lately I was trying to program a game of Rummy, which is a card game where you have to replace one of your cards with the card on the deck until you have grouped your cards into sequences, where each sequence is 3 or more cards. The challenge with programming such a game seemed to be with writing code which determines which of your cards should be discarded to be replaced with the one on the deck.

The way this was done was by giving each card in the deck a probability of being available. Cards in the player's hand are given a probability of 1 and cards which are lost in the graveyard or taken by the opponent are given a probability of 0. The other cards will be given a probability depending on how likely the opponent has them given the cards they have taken.

Once each card has a probability of being obtained we can then calculate the probability of every legal sequence of cards in a group and of each possible arrangement of groups. The probability of each group is calculated and the groups with the maximum probability, containing both cards in the player's hand and cards which are likely to be available, is to be pursued. Any cards in the player's hand which are not in the most likely groups to arrange can be discarded to be replaced.

Examples of legal sequences of cards which can be used as a group in Rummy are
• ace of diamonds, ace of hearts, ace of spades
• two of hearts, two of clubs, two of diamonds
• two of hearts, two of clubs, two of diamonds, two of spades
• ace of diamonds, two of diamonds, three of diamonds
• etc.

A number of groups needs to be arranged such that the total number of cards equals 10 and no two groups have the same cards. The probability of a card group is the product of the probabilities of each card in the group. The aim is to find an arrangement of card groups that have the maximum probability of being obtained. Obviously since only cards that are already in the player's hand have a probability of 1, the more cards used which are already in the player's hand, the better.

Since a group cannot contain any cards in another group, we need to treat the choosing of groups as a probability tree, where each node in the tree is a new legal group. As we choose more groups, we have less groups to choose from for the next choice. Since there are so many cards to choose from, an algorithm which quickly chooses the most probable sequence in the probability tree was needed.

The probability tree would look something like this:

Most probable sequence in a probability tree
I'll be expressing the algorithm using Python. I'll be using this Node class to represent a node in the probability tree:
class Node:
def __init__(self, prob, children):
self.prob = prob
self.children = children

Where "prob" is the probability of choosing that node and "children" is a list of child nodes to follow next. Obviously in practice both of these properties can be dynamically generated with callback functions.

After having a tree, we need to represent a root path in the tree which is a list of nodes from the root to one of the leaves. We also need to calculate the probability of the path which is done using this function:
def path_prob(path):
prob = 1.0
for node in path:
prob *= node.prob
return prob

The probability of nodes in sequence in a probability tree is the product of each node's probability.

We'll start from a simple algorithm, one which generates the paths from the root to every leaf node.
def all_root_paths_nonacc(node):
if node.children == []:
yield [node]
else:
for child in node.children:
for subpath in all_root_paths_nonacc(child):
yield [node] + subpath

It's a naturally recursive generator function which doesn't use an accumulator. It starts from the root and attaches that root node to the front of every path from its children to a leaf. When a leaf is reached, a list consisting of only that leaf is returned.

Once we have this function, we can generate every root path in our probability tree and then find which path has the greatest probability using some code such as:
#comment to make code longer than one line
max(all_root_paths_nonacc(root), key=lambda path:path_prob(path))
#comment to make code longer than one line


However this would be inefficient to use. We need to be able to choose the most probable path during the recursion in such a way that we can stop recurring mid-way through the paths in order to avoid wasting time. I'll explain how this will work in a minute. First we need to know what path we are building during recursion and to do that we need to use an accumulator, like this:
def all_root_paths(node, prev_path=[]):
curr_path = prev_path + [node]
if node.children == []:
yield curr_path
else:
for child in node.children:
for path in all_root_paths(child, curr_path):
yield path

This is a modified version of the previous algorithm. It keeps an extra accumulator parameter which starts off being an empty list and then start adding the current node to the end of it every time a new recursion is made. That way at any point in the recursion you can know what path is being built from the root node to the current node. When a leaf is reached, the path is returned.

This algorithm still returns all the root paths in the probability tree which we then have to choose from. Now we shall use an algorithm which returns the most probable path without returning any other path.
def naive_max_root_path(node, prev_path=[], curr_max_path=None):
curr_path = prev_path + [node]
if node.children == []:
if curr_max_path == None or path_prob(curr_path) > path_prob(curr_max_path):
return curr_path
else:
return curr_max_path
else:
for child in node.children:
curr_max_path = naive_max_root_path(child, curr_path, curr_max_path)
return curr_max_path

Again this is a modification of the previous algorithm. This time we keep two accumulators, one with the path that has been built up to now and one with the path which is the most probable up to now. Every time a full path is built (by reaching a leaf) a choice is made between returning the newly built path and the best current path. The returned path is then used as the current best path. Obviously if the current best path is "None" (null), then it means that we are still generating the first path and so this should be considered the best path.

Even if only the best path is returned, the algorithm checks each and every possible path there is, so this algorithm will not be faster than the other algorithms. We shall now instead use to our advantage the fact that when a probability is multiplied by another probability, the product will be smaller than or equal to both probabilities. What this means is that as we multiply the probabilities of the nodes in a path from first to last, the product will never grow but will either remain the same (if there is a probability sequence of 1s) or will become smaller. So if at any point the probability of a partially built path becomes smaller than the probability of the best path, we can immediately stop building that path and try some other path since its probability will only get even smaller.
def quick_max_root_path(node, prev_path=[], curr_max_path=None):
curr_path = prev_path + [node]
if node.children == []:
if curr_max_path == None or path_prob(curr_path) > path_prob(curr_max_path):
return curr_path
else:
return curr_max_path
else:
for child in node.children:
if curr_max_path == None or path_prob(curr_path + [child]) > path_prob(curr_max_path):
curr_max_path = quick_max_root_path(child, curr_path, curr_max_path)
return curr_max_path

As you can see, the only difference between this and the previous algorithm is that a decision is also made in the recursive case when the path is still being built. This algorithm can probably be even faster by avoiding the calls to "path_prob" and passing the probabilities as parameters too. Then with every recursive call, the probability of the current path is the probability of the previous path times the probability of the new node.

Many times, especially in the Rummy case, there is more than one best sequence in a probability tree, all of which have an equal probability. In such cases we might want an algorithm which returns all such paths and not just the first one. Here is a naive version of the algorithm which checks every possible root path:
def naive_max_root_paths(node, prev_path=[], curr_max_paths=[]):
curr_path = prev_path + [node]
if node.children == []:
if curr_max_paths == [] or path_prob(curr_path) > path_prob(curr_max_paths[0]):
return [curr_path]
elif path_prob(curr_path) == path_prob(curr_max_paths[0]):
return curr_max_paths + [curr_path]
else:
return curr_max_paths
else:
for child in node.children:
curr_max_paths = naive_max_root_paths(child, curr_path, curr_max_paths)
return curr_max_paths

Obviously we now need to return a list and so the parameter which keeps the current best path needs to be a list of paths. As can be seen, there are now three cases to consider in the base case which are:
• If the current path is better than any of the best paths (all paths in "curr_max_paths" have an equal probability), return a list consisting of only the current path.
• If the current path is equal to any of the best path, return a list of the best paths together with the current path/
• If the best paths are better than the current path, return the best paths only.
So the only difference between this and "naive_max_root_path" is that the path is considered if it is of equal probability to the best and added to a list if so.

We now use the previous trick to make the algorithm quick by skipping paths which are worse than the best mid-way through. However we now need to consider paths which are of equal probability to the best too in the recursive case.
def quick_max_root_paths(node, prev_path=[], curr_max_paths=[]):
curr_path = prev_path + [node]
if node.children == []:
if curr_max_paths == [] or path_prob(curr_path) > path_prob(curr_max_paths[0]):
return [curr_path]
elif path_prob(curr_path) == path_prob(curr_max_paths[0]):
return curr_max_paths + [curr_path]
else:
return curr_max_paths
else:
for child in node.children:
if curr_max_paths == [] or path_prob(curr_path + [child]) >= path_prob(curr_max_paths[0]):
curr_max_paths = quick_max_root_paths(child, curr_path, curr_max_paths)
return curr_max_paths


Conclusion
As already explained, the probability tree path in Rummy will be the card groups chosen and we need to find which arrangement of card groups is the most likely to be available and thus that the player should aim for. Now that we can find which arrangement of card groups is the most probable, we know which cards we can discard, and since, using the last algorithm, we can know all the arrangements that give the best probability, we can even narrow down the cards to discard to the ones which are needed the least in all the alternative best arrangements.

## Friday, May 25, 2012

### Recurring numbers and finite state automata

Today we shall be talking about recurring numbers and we'll be explaining them using finite state automata.

A recurring number is generated by dividing two whole numbers to obtain a decimal whose digits after the point never end. For example, if we divide 1 by 3, we'll get 0.333333333... Remember that division works by dividing a number by the next digit, writing the whole number result as a digit of the answer and using the remainder as a carry, like so:

1 can be written as 1.00000..., which is what we'll do.

\------------
3/1 . 0 0 0 0

1 divided by 3 is 0 remainder 1, so we write 0 above and put 1 as a carry on the next digit.

0
\-----------
3/1 . 10 0 0

10 divided by 3 is 3 remainder 1, so we write 3 above and put 1 as a carry on the next digit.

0 . 3
\------------
3/1 . 10 10 0

10 divided by 3 is 3 remainder 1, so we write 3 above and put 1 as a carry on the next digit.

0 . 3  3
\-------------
3/1 . 10 10 10

and so on.


But recurring numbers can sometimes be less simple, such as 1/7 which is 0.142857142857142857...:

1 can be written as 1.00000..., which is what we'll do.

\------------
7/1 . 0 0 0 0

0
\-----------
7/1 . 10 0 0

0 . 1
\------------
7/1 . 10 30 0

...

0 . 1  4  2  8  5  7
\-------------------------
7/1 . 10 30 20 60 40 50 10

...

0 . 1  4  2  8  5  7  1  4  2  8  5  7
\-------------------------------------------
7/1 . 10 30 20 60 40 50 10 30 20 60 40 50 10


So why does this happen? Why don't the digits kept changing differently but instead start repeating a sequence? It's because the next digit depends on what remainder the previous digit left, so if the previous digit left a remainder which was left by another previous digit then the new digit and new remainder will be the same as the earlier digit. For example, in this case, the remainder was a 1, but the first remainder left was also a 1, so the sequence will repeat itself:

0 . 1  4  2  8  5  7
\-------------------------
7/1 . 10 30 20 60 40 50 10

0 . 1  4  2  8  5  7  1
\-------------------------
7/1 . 10 30 20 60 40 50 10

0 . 1  4  2  8  5  7  1  4
\---------------------------
7/1 . 10 30 20 60 40 50 10 30

...

0 . 1  4  2  8  5  7  1  4  2  8  5  7
\-------------------------------------------
7/1 . 10 30 20 60 40 50 10 30 20 60 40 50 10


More technically, this is because the digits are generated using a finite state, the remainder, and when you try to generate an infinite sequence using a finite state, eventually you must get repetition, since you must reuse a previously used state at some point in the infinite sequence.

We can draw this idea using a finite state automaton where the states are the remainders left by a digit and the transition gives the next digit. For example, this is what the finite state automaton for 1/7 is:

Remember, the numbers in the circles are remainders, which are used as carries in front of zeros, and the numbers on the arrows are digits in the result. So we start from a remainder of 1 which results from when we start dividing 1 by 7, and start jumping from remainder to remainder. Now we can either end up with a remainder of 0, in which case the result would terminate and will not be recurring, or it will jump to a remainder which has already been traversed, in which case it will enter an infinite loop, which results in a recurring number. As the transitions show, in the case of dividing by 7, the second case is inevitable, regardless of which non-zero remainder we start with.

Since every number can leave only one of a finite number of remainders when used to divide another number, the result can either be a terminating number or a recurring number and cannot keep on changing forever. Further more, since the number of remainders a number can leave when divided is equal to that number (x/2 can only leave either 0 or 1 as remainders, x/3 can only leave either 0, 1 or 2 as remainders, etc) then there can only be that number of remainders. Since there can only be that number of remainders, and since a remainder of 0 would terminate the number, then the length of the recurring part of a number cannot be longer than one less than the denominator of that number. For example, for all x, x/7 cannot have a recurring part longer than 6 digits. Likewise, for all x, x/100 cannot have a recurring part longer than 99 digits.

Notice that the finite state automaton as used here only works for when we start adding zeros after the point during division. For example, if we divide the following:

\--------------------
7/1 . 1  2  3  0  0  0


then the finite state automaton can only be used when we start using the zeros at the end since the numbers in the automaton assume that the remainders are carried on to zeros. Since even if we use another recurring number, that recurring number can be converted to a fraction and the quotient of two fractions is another fraction, then the finite state automaton can also be used on recurring numbers. For example, if we divide the following:

\--------------------
7/0 . 3  3  3  3  3  3 ...


then we can convert 0.3333... into 1/3, the division can be converted to the fraction (1/3)/7 which can be simplified to 1/(3x7) = 1/21. Now we can use the finite state automaton on 1/21 and reason on when the result should start recurring.

Finally we see that if we divide an irrational number then the result could also be irrational since the state is not finite anymore but will be changing forever with every digit in the divided number.

This is the basis for the finite state automata's pumping lemma, which says that if the number of letters (or digits in this case) generated from a finite state automaton exceeds the number or states in the automaton, then there must be reuse of a previously visited state and thus the letters cannot be forever randomly generated but must follow a regular pattern at that point. In this case we have an even stronger pumping lemma which says that not only will the digits follow a regular pattern after exceeding the number of states in the automaton but will forever repeat the same sequence. This is because there is only one transition going out of each state since only one remainder can be left when dividing a number, so once a state is revisited there is no other path to follow except the previously visited path.

## Tuesday, May 1, 2012

### Summary of research paper "Chargaff’s “Grammar of Biology”: New Fractal-like Rules" by Yamagishi & Herai

This is a summary of the research paper http://arxiv.org/ftp/arxiv/papers/1112/1112.1528.pdf. Although it has "grammar of biology" in its title, it's basically about patterns in the frequencies of the 4 nucleotides in DNA: Adenine, Thymine, Cytosine and Guanine.

In case you don't know about it, DNA is a chain of these 4 types of molecules called nucleotides. Each nucleotide is represented by single letter which is the first letter of their name. So "TATA" can be a part of a DNA sequence for example. The DNA is made of 2 strands twisted around each other into a double helix. The two stands must be "complementary" such that if the first nucleotide on one strand is an "A" then the other strand must start with a "T", if it is a "T" then other must be an "A", if it is a "C" then the other must be a "G" and if it is a "G" then the other must be a "C", and the same goes for each nucleotide position in each strand.

Erwin Chargaff had discovered 2 simple patterns regarding the frequencies of the nucleotides in most organisms.

Chargaff's first parity rule says that if you count the number of each nucleotide on each strand, you'll find that the number of A nucleotides on one strand is equal to the number of T nucleotides on the other strand and so on for each complementary nucleotide. This is because of what was said in the previous paragraph.

The second parity rule says that the first rule also applies to the same strand, but only approximately. So the number of A nucleotides on a strand is approximately equal to the number of T nucleotides on the same strand and so on for each complementary nucledotide. It is less understood why this should happen. The problem is that if it were because the number of nucleotides is uniformly random, then the number of A and T nucleotides and the number of G and C nucleotides should not only be approximately equal, but the two groups should also be equal to each other, which is not the case. So the number of A nucleotides is approximately equal to the number of T nucleotides but not to the number of C or G nucleotides.

The second parity rule was later discovered to be a special case of a more general rule which says that on a given strand of DNA, the frequency of a sequence of nucleotides is approximately equal to the frequency of the reverse of the complement of that sequence. For example the number of time that "TAG" appears in a strand of DNA is approximately equal to the number of times that "CTA" appears in the same strand.

The research paper being discussed adds some new rules about the frequencies of the nucleotides.

Let's start from some notation.
• Let w be a nucleotide sequence such as "ATCGAT", S be a DNA strand and S' be the other strand of the DNA.
• |p| is the length of the string p (could be a word or a strand). For example |"ATA"| = 3.
• F(w, S) is the fraction of the number of times w occurs in S over |S|-|w|+1. In other words, it is calculated using a sliding window and the number of times the word is found is divided by the number of times the window is moved. For example F("AA", "AATAAA") = 3/5. (This was confirmed in private conversation with the authors of the paper)
• C(w) is the complementary word of w nucleotide to nucleotide. For example C("ATCG") = "TAGC".
• R(w) is the reverse of w. For example R("ATCG") = "GCTA".

Using this notation, Chargaff's rules can be represented as:
1: F(w, S) = F(C(w), S')
2: F(w, S) ≈ F(R(C(w)), S) (generalized version)

What the authors of the paper did was that they organized all possible words of a certain length into a table which they called a "Math table" such that each row will have a new word, its complement, its reverse and its reverse complement, but each word can only be used once in the table. Since each element of the table is unique, the words in the first column which generate the other columns are called a "generating set" or "G", and each word in "G" is labelled "g". According to the authors of the paper, there is no special way how to determine which words go into G.

For example here is a Math table for words of length 2:
gC(g)R(g)C(R(g))
AATT--
ATTA--
CCGG--
CGGC--
ACTGCAGT
AGTCGACT

By summing the frequencies of each column in the table it was determined that:
Eq1: SUM(F(g, S) for g in G) ≈ SUM(F(C(g), S) for g in G)
Eq2: SUM(F(R(g), S) for g in G) ≈ SUM(F(C(R(G)), S) for g in G)

Note: I'm not sure if the way G is chosen makes a difference or if the blanks in the Math table affect the summation.

The authors of the paper say that there are other identities were discovered using the Math table, but don't mention them.

Since SUM(F(g, S) + F(C(g), S) + F(R(g), S) + F(C(R(g)), S) for g in G) = 100% (assuming R(g) returns blank and hence F(R(g), S) and F(C(R(g), S) give 0 if R(g) is blank in the Math table),

SUM(F(g, S) + F(C(g), S) for g in G) + SUM(F(R(g), S) + F(C(R(g)), S) for g in G) = 100%

Using Eq1 and Eq2, we can turn this equation into

SUM(2F(g, S) for g in G) + SUM(2F(R(g), S) for g in G) ≈ 100%
2SUM(F(g, S) for g in G) + 2SUM(F(R(g), S) for g in G) ≈ 100%
Eq3: SUM(F(g, S) for g in G) + SUM(F(R(g), S) for g in G) ≈ 50%
Or
Eq4: SUM(F(C(g), S) for g in G) + SUM(F(C(R(g)), S) for g in G) ≈ 50%

Eq3 was confirmed with a variety of different species DNA.