Geeky is Awesome: Summary of research paper "Readability Assessment for Text Simplification" by Sandra Aluisio, Lucia Specia, Caroline Gasperin and Carolina Scarton

This is a summary of the research paper http://aclweb.org/anthology-new/W/W10/W10-1001.pdf. This paper is about a program which classifies text into one of three difficulty of reading levels: "rudimentary", "basic" and "advanced". These difficulty levels are defined by INAF (Portuguese document at http://www.ibope.com.br/ipm/relatorios/relatorio_inaf_2009.pdf), the National Indicator of Functional Literacy. The aim is to use this program to assist authors to detect parts of their writing which can be simplified. The program is designed to work with Portuguese rather than English.

The following list of features is extracted from the Portuguese text to assess using various tools and resources described in the paper (copied directly from the paper):

Number of words
Number of sentences
Number of paragraphs
Number of verbs
Number of nouns
Number of adjectives
Number of adverbs
Number of pronouns
Average number of words per sentence
Average number of sentences per paragraph
Average number of syllables per word
Flesch index for Portuguese
Incidence of content words
Incidence of functional words
Raw Frequency of content words
Minimal frequency of content words
Average number of verb hypernyms
Incidence of NPs
Number of NP modifiers
Number of words before the main verb
Number of high level constituents
Number of personal pronouns
Type-token ratio
Pronoun-NP ratio
Number of “e” (and)
Number of “ou” (or)
Number of “se” (if)
Number of negations
Number of logic operators
Number of connectives
Number of positive additive connectives
Number of negative additive connectives
Number of positive temporal connectives
Number of negative temporal connectives
Number of positive causal connectives
Number of negative causal connectives
Number of positive logic connectives
Number of negative logic connectives
Verb ambiguity ratio
Noun ambiguity ratio
Adverb ambiguity ratio
Adjective ambiguity ratio
Incidence of clauses
Incidence of adverbial phrases
Incidence of apposition
Incidence of passive voice
Incidence of relative clauses
Incidence of coordination
Incidence of subordination
Out-of-vocabulary words
LM probability of unigrams
LM perplexity of unigrams
LM perplexity of unigrams, without line break
LM probability of bigrams
LM perplexity of bigrams
LM perplexity of bigrams, without line break
LM probability of trigrams
LM perplexity of trigrams
LM perplexity of trigrams, without line break

The features from 1 to 42 are the "Coh-Metix PORT" feature group and were derived from the Coh-Metrix-PORT tool, which is a Portuguese adaptation of the Coh-Metrix.

The features from 43 to 49 are the "Syntactic" feature group and were added to analyse syntactic constructions which are useful for automatic simplification.

The features from 50 to 59 are the "Language model" feature group and are derived by comparing n-grams in the input text to n-gram frequencies in the language (Portuguese) as well as their perplexity and scores for any words which are not listed in the system's vocabulary.

The features from 1 to 3 and 9 to 11 are also called the "Basic" feature group because they require no linguistic knowledge. Feature 12 is also put into a feature group on its own called "Flesch".

Experiments were performed to check if it is possible to use machine learning techniques to learn to detect the INAF difficulty of reading levels and to discover which features are best to use for complexity detection.

Three types of machine learning algorithms in Weka were evaluated:

"Standard classification" (SMO): A standard classifier (support vector machines) which assumes that there is no relation between the difficulty levels
"Ordinal classification" (OrdinalClassClassifier): An ordinal classifier which assumes that the difficulty levels are ordered
"Regression" (SMO-reg): A regressor which assumes that the difficulty levels are continuous

Manually simplified corpora were used as training data for the algorithms. The corpora were simplified into each of the difficulty levels. Using these simplified corpora it was possible to check if the system could predict that the original corpora are of "advanced" difficulty level, the lightly simplified corpora are of "basic" difficulty level and the heavily simplified corpora are of "rudimentary" difficulty level.

In order to determine the importance of each feature described, the absolute Pearson correlation between each feature of each corpora and how well it predicts the corpora's difficulty level is computed. The results for the best features below are copied from the paper (1=highly correlated, 0=not correlated):

Words per sentence: 0.693
Incidence of apposition: 0.688
Incidence of clauses: 0.614
Flesch index: 0.580
Words before main verb: 0.516
Sentences per paragraph: 0.509

In order to determine which machine learning algorithm is better at classifying the corpora into correct difficulty levels using the features mentioned earlier, the Pearson correlation with true score was used on how the corpora were classified. The experiments were also run using the sub-groups of features which were mentioned to see how important each group was. Here are the results, summarized from the paper:

Standard classification

Feature group	Pearson correlation
All	0.84
Language model	0.25
Basic	0.76
Syntactic	0.82
Coh-Metrix-PORT	0.79
Flesch	0.52

Ordinal classification

Feature group	Pearson correlation
All	0.83
Language model	0.49
Basic	0.73
Syntactic	0.81
Coh-Metrix-PORT	0.8
Flesch	0.56

Regression

Feature group	Pearson correlation
All	0.8502
Language model	0.6245
Basic	0.7266
Syntactic	0.8063
Coh-Metrix-PORT	0.8051
Flesch	0.5772

It is clear that it is better to use all the features together than any sub-group of them, but the syntactic features, followed by the Coh-Metrix-PORT features are the most useful feature groups and the language model feature group was the worst.

The simple classification algorithm was chosen as the best because although it is the simplest algorithm, its results are comparable to the other algorithms' results.

Geeky is Awesome

Saturday, June 23, 2012

Summary of research paper "Readability Assessment for Text Simplification" by Sandra Aluisio, Lucia Specia, Caroline Gasperin and Carolina Scarton

No comments:

Post a Comment