The following list of features is extracted from the Portuguese text to assess using various tools and resources described in the paper (copied directly from the paper):
- Number of words
- Number of sentences
- Number of paragraphs
- Number of verbs
- Number of nouns
- Number of adjectives
- Number of adverbs
- Number of pronouns
- Average number of words per sentence
- Average number of sentences per paragraph
- Average number of syllables per word
- Flesch index for Portuguese
- Incidence of content words
- Incidence of functional words
- Raw Frequency of content words
- Minimal frequency of content words
- Average number of verb hypernyms
- Incidence of NPs
- Number of NP modifiers
- Number of words before the main verb
- Number of high level constituents
- Number of personal pronouns
- Type-token ratio
- Pronoun-NP ratio
- Number of “e” (and)
- Number of “ou” (or)
- Number of “se” (if)
- Number of negations
- Number of logic operators
- Number of connectives
- Number of positive additive connectives
- Number of negative additive connectives
- Number of positive temporal connectives
- Number of negative temporal connectives
- Number of positive causal connectives
- Number of negative causal connectives
- Number of positive logic connectives
- Number of negative logic connectives
- Verb ambiguity ratio
- Noun ambiguity ratio
- Adverb ambiguity ratio
- Adjective ambiguity ratio
- Incidence of clauses
- Incidence of adverbial phrases
- Incidence of apposition
- Incidence of passive voice
- Incidence of relative clauses
- Incidence of coordination
- Incidence of subordination
- Out-of-vocabulary words
- LM probability of unigrams
- LM perplexity of unigrams
- LM perplexity of unigrams, without line break
- LM probability of bigrams
- LM perplexity of bigrams
- LM perplexity of bigrams, without line break
- LM probability of trigrams
- LM perplexity of trigrams
- LM perplexity of trigrams, without line break
The features from 1 to 42 are the "Coh-Metix PORT" feature group and were derived from the Coh-Metrix-PORT tool, which is a Portuguese adaptation of the Coh-Metrix.
The features from 43 to 49 are the "Syntactic" feature group and were added to analyse syntactic constructions which are useful for automatic simplification.
The features from 50 to 59 are the "Language model" feature group and are derived by comparing n-grams in the input text to n-gram frequencies in the language (Portuguese) as well as their perplexity and scores for any words which are not listed in the system's vocabulary.
The features from 1 to 3 and 9 to 11 are also called the "Basic" feature group because they require no linguistic knowledge. Feature 12 is also put into a feature group on its own called "Flesch".
Experiments were performed to check if it is possible to use machine learning techniques to learn to detect the INAF difficulty of reading levels and to discover which features are best to use for complexity detection.
Three types of machine learning algorithms in Weka were evaluated:
- "Standard classification" (SMO): A standard classifier (support vector machines) which assumes that there is no relation between the difficulty levels
- "Ordinal classification" (OrdinalClassClassifier): An ordinal classifier which assumes that the difficulty levels are ordered
- "Regression" (SMO-reg): A regressor which assumes that the difficulty levels are continuous
Manually simplified corpora were used as training data for the algorithms. The corpora were simplified into each of the difficulty levels. Using these simplified corpora it was possible to check if the system could predict that the original corpora are of "advanced" difficulty level, the lightly simplified corpora are of "basic" difficulty level and the heavily simplified corpora are of "rudimentary" difficulty level.
In order to determine the importance of each feature described, the absolute Pearson correlation between each feature of each corpora and how well it predicts the corpora's difficulty level is computed. The results for the best features below are copied from the paper (1=highly correlated, 0=not correlated):
- Words per sentence: 0.693
- Incidence of apposition: 0.688
- Incidence of clauses: 0.614
- Flesch index: 0.580
- Words before main verb: 0.516
- Sentences per paragraph: 0.509
In order to determine which machine learning algorithm is better at classifying the corpora into correct difficulty levels using the features mentioned earlier, the Pearson correlation with true score was used on how the corpora were classified. The experiments were also run using the sub-groups of features which were mentioned to see how important each group was. Here are the results, summarized from the paper:
Standard classification
Feature group | Pearson correlation |
---|---|
All | 0.84 |
Language model | 0.25 |
Basic | 0.76 |
Syntactic | 0.82 |
Coh-Metrix-PORT | 0.79 |
Flesch | 0.52 |
Ordinal classification
Feature group | Pearson correlation |
---|---|
All | 0.83 |
Language model | 0.49 |
Basic | 0.73 |
Syntactic | 0.81 |
Coh-Metrix-PORT | 0.8 |
Flesch | 0.56 |
Regression
Feature group | Pearson correlation |
---|---|
All | 0.8502 |
Language model | 0.6245 |
Basic | 0.7266 |
Syntactic | 0.8063 |
Coh-Metrix-PORT | 0.8051 |
Flesch | 0.5772 |
It is clear that it is better to use all the features together than any sub-group of them, but the syntactic features, followed by the Coh-Metrix-PORT features are the most useful feature groups and the language model feature group was the worst.
The simple classification algorithm was chosen as the best because although it is the simplest algorithm, its results are comparable to the other algorithms' results.
No comments:
Post a Comment