In order to build such a corpus, both Wikipedia sites were downloaded, the articles were paired by title, all pairs that contained an unusable article (e.g. stubs, disambiguation pages, etc.) were removed, the equivalent paragraphs in the article pairs were paired up and finally the equivalent sentences in the paragraph pairs were paired up.
To align paragraphs in the article pairs, where a single paragraph in Simple English Wikipedia could correspond to multiple paragraphs in Wikipedia but not the other way round, the words in each paragraph from the Simple English Wikipedia article were used as a set of keywords to search for all corresponding paragraphs in the Wikipedia article. This was done by using TF-IDF and keeping only those paragraphs which resulted in a vector with cosine similarity to the Simple English Wikipedia paragraph of 0.5 or more. In this way, each paragraph in the Simple English Wikipedia article is paired up with a list of paragraphs in the Wikipedia article which are similar to it.
Once the paragraph pairs were found in each article pair, the next step was to pair up the sentences in the paragraph pairs such that in each pair you get a Wikipedia sentence and its simplified counterpart from Simple English Wikipedia. Such a pairing of sentences is called an "alignment". This was less simple than how the paragraphs were paired because now there could be multiple sentences in Simple English Wikipedia that correspond to multiple sentences in Wikipedia. Instead, the sentences in the paragraph pairs are paired up in many different ways and the alignment which gives the maximum sum of similarities in each sentence pair was used. The similarity of two sentences in a pair is found by using TF-IDF with cosine similarity. The paragraph pairing was important in order to limit the number of different ways to attempt to align the sentences.
Following some clarifications via email with the author, here is how the sentence alignment algorithm works:
Let "n" be the number of Wikipedia sentences in a paragraph pair
Let "m" be the number of Simple English Wikipedia sentences in a paragraph pair
Let "sim(i,j)" be a function which measures the similarity between the "i"th Wikipedia sentence and the "j"th Simple English Wikipedia sentence (using TF-IDF with cosine similarity) in a paragraph pair
I shall now introduce a new function "align(n,m)" which is not mentioned in the paper but which will help explain the whole sentence alignment concept.
The function "align(n,m)" gives the best sentence alignment such that the sum of the similarity between each sentence pair is maximized. In order to align the sentences, 6 operations are successively applied to the sentences in the paired paragraphs. The operations are:
- Skip the next sentence in the Wikipedia paragraph (it does not correspond to any of the sentences in the Simple English Wikipedia paragraph)
- Skip the next sentence in the Simple English Wikipedia paragraph (it does not correspond to any of the sentences in the Wikipedia paragraph)
- Pairing the next Wikipedia sentence to the next Simple English Wikipedia sentence
- Pairing the next Wikipedia sentence to the next two Simple English Wikipedia sentences
- Pairing the next two Wikipedia sentences to the next Simple English Wikipedia sentence
- Pairing the next two Wikipedia sentences to the next two Simple English Wikipedia sentences crossed up (for example, if the Wikipedia sentences are [a, b] and the Simple English Wikipedia sentences are [x, y] then the pairing would be [(a,y), (b,x)])
These operations are applied successively using dynamic programming in order to find the alignment with the maximum sum of similarities.
We shall first see the function "a(n,m)" which calculates this maximal similarity and then we shall see "align(n,m)". "a(n,m)" will also be applying the operations using dynamic programming, but it will only be calculating the sum of the similarities of the sentence pairs rather then the actual pairs. Having initial arguments being the number of sentences in each paragraph in the paragraph pair, "a(n,m)" will recursively call itself on less and less sentences until it reaches a base case of having no sentences to align. In accordance with the paper, the "i"th sentence is not 0 based like in arrays but 1 based. So the first sentence is in position 1, the second is in position 2, etc.
This is "a(i,j)" in Python without using dynamic programming (for simplicity):
def a(i, j): if i == 0 or j == 0: #Base case: if there are no more sentences to align in the paragraphs then the similarity is 0 return 0 else: #Recursive case: if there are sentences to align then check what the maximum similarity achieved will be after applying each operation and return the maximum of the similarities alternatives = [] #Max similarity after skipping next Simple English Wikipedia sentence alternatives.append(a(i, j-1) - skip_penalty) #Max similarity after skippin next Wikipedia sentence alternatives.append(a(i-1, j) - skip_penalty) #Max similarity after pairing next Wikipedia sentence with next Simple English Wikipedia sentence alternatives.append(a(i-1, j-1) + sim(i, j)) if j >= 2: #If there are more than one Simple English Wikipedia sentences then try pairing up two such sentences #Max similarity after pairing next Wikipedia sentence with next two Simple English Wikipedia sentences alternatives.append(a(i-1, j-2) + sim(i, j-1) + sim(i, j)) if i >= 2: #If there are more than one Wikipedia sentences then try pairing up two such sentences #Max similarity after pairing next two Wikipedia sentences with next Simple English Wikipedia sentence alternatives.append(a(i-2, j-1) + sim(i-1, j) + sim(i, j)) if i >= 2 and j >= 2: #If there are more than one sentences in both paragraphs then try pairing up two of each #Max similarity after pairing next two Wikipedia sentences with next two Simple English Wikipedia sentences crossed up alternatives.append(a(i-2, j-2) + sim(i-1, j) + sim(i, j-1)) return max(alternatives) #Return the maximum of the similarities
"skip_penalty" is a constant (the authors set it to 0.0001) which is used to reduce the total similarity every time a sentence is skipped. This is to discourage skipping too many sentences.
Now "align(n,m)" would be very similar to "a(n,m)". Every "sim(x,y)" there is in "a(n,m)" represents a pairing of the "x"th Wikipedia sentence with the "y"th Simple English Wikipedia sentence. So in "align(n,m)", every time we compute "sim(x,y)" we would add the pair "(x,y)" to our alignment. We still need to know what the maximum similarity of the alignment will be after applying each operation so "align(n,m)" needs to return both the similarity and the alignment.
This is "align(i,j)" in Python without using dynamic programming (for simplicity):
def align(i, j): if i == 0 or j == 0: #Base case: if there are no more sentences to align in the paragraphs then the similarity is 0 return (0, []) else: #Recursive case: if there are sentences to align then check what the maximum similarity achieved will be after applying each operation and return the maximum of the similarities alternatives = [] #Best alignment after skipping next Simple English Wikipedia sentence (similarity, alignment) = align(i, j-1) alternatives.append((similarity - skip_penalty, alignment)) #Best alignment after skippin next Wikipedia sentence (similarity, alignment) = align(i-1, j) alternatives.append((similarity - skip_penalty, alignment)) #Best alignment after pairing next Wikipedia sentence with next Simple English Wikipedia sentence (similarity, alignment) = align(i-1, j-1) alternatives.append((similarity + sim(i, j), alignment + [(i, j)])) if j >= 2: #If there are more than one Simple English Wikipedia sentences then try pairing up two such sentences #Best alignment after pairing next Wikipedia sentence with next two Simple English Wikipedia sentences (similarity, alignment) = align(i-1, j-2) alternatives.append((similarity + sim(i, j-1) + sim(i, j), alignment + [(i, j-1), (i, j)])) if i >= 2: #If there are more than one Wikipedia sentences then try pairing up two such sentences #Best alignment after pairing next two Wikipedia sentences with next Simple English Wikipedia sentence (similarity, alignment) = align(i-2, j-1) alternatives.append((similarity + sim(i-1, j) + sim(i, j), alignment + [(i-1, j), (i, j)])) if i >= 2 and j >= 2: #If there are more than one sentences in both paragraphs then try pairing up two of each #Best alignment after pairing next two Wikipedia sentences with next two Simple English Wikipedia sentences crossed up (similarity, alignment) = align(i-2, j-2) alternatives.append((similarity + sim(i-1, j) + sim(i, j-1), alignment + [(i-1, j), (i, j-1)])) return max(alternatives, key=lambda p:p[0]) #Return the maximum of the similarities (the similarity is the first element of each similarity/alignment pair)
And here is the same algorithm using dynamic programming:
def align(i, j): memo = dict() for x in range(0, i+1): memo[(x, 0)] = (0.0, []) for y in range(0, j+1): memo[(0, y)] = (0.0, []) for y in range(1, j+1): for x in range(1, i+1): alternatives = [] #Best alignment after skipping next Simple English Wikipedia sentence (similarity, alignment) = memo[(x, y-1)] alternatives.append((similarity - skip_penalty, alignment)) #Best alignment after skippin next Wikipedia sentence (similarity, alignment) = memo[(x-1, y)] alternatives.append((similarity - skip_penalty, alignment)) #Best alignment after pairing next Wikipedia sentence with next Simple English Wikipedia sentence (similarity, alignment) = memo[(x-1, y-1)] alternatives.append((similarity + sim(x, y), alignment + [(x, y)])) if y >= 2: #If there are more than one Simple English Wikipedia sentences then try pairing up two such sentences #Best alignment after pairing next Wikipedia sentence with next two Simple English Wikipedia sentences (similarity, alignment) = memo[(x-1, y-2)] alternatives.append((similarity + sim(x, y-1) + sim(x, y), alignment + [(x, y-1), (x, y)])) if x >= 2: #If there are more than one Wikipedia sentences then try pairing up two such sentences #Best alignment after pairing next two Wikipedia sentences with next Simple English Wikipedia sentence (similarity, alignment) = memo[(x-2, y-1)] alternatives.append((similarity + sim(x-1, y) + sim(x, y), alignment + [(x-1, y), (x, y)])) if x >= 2 and y >= 2: #If there are more than one sentences in both paragraphs then try pairing up two of each #Best alignment after pairing next two Wikipedia sentences with next two Simple English Wikipedia sentences crossed up (similarity, alignment) = memo[(x-2, y-2)] alternatives.append((similarity + sim(x-1, y) + sim(x, y-1), alignment + [(x-1, y), (x, y-1)])) memo[(x,y)] = max(alternatives, key=lambda p:p[0]) #Return the maximum of the similarities (the similarity is the first element of each similarity/alignment pair) return memo[(i,j)]
This function will return the best alignment and its total similarity. The alignment will be a list of sentence position pairs and not the sentence pairs themselves.
Once the sentences in each paragraph were aligned, the sentence pairs were then filtered such that if any sentence pairs were less similar than 0.5 then they would be discarded. The filtering went from 10,000 article pairs to 75,000 paragraph pairs to 137,000 sentence pairs. Out of a sample of 100 sentence pairs, human judges determined that 91 of them were correct sentence simplifications. They were also tested on a phrase based translation system called Moses which simplified test sentences with statistically significant improvement.
The final parallel corpus is available at http://www.cs.pomona.edu/~dkauchak/simplification/.