Author: Maksim Eremeev (me@maksimeremeev.com)
Research: Konstantin Vorontsov, Maksim Eremeev
Papers:
RANLP paper, Overview in Russian
Interactive Demo: TextComplexity.net
This is a framework for testing and experimenting with complexity measures, building and saving models fitted on various reference collections. The library provides efficient parallel processing of reference collections.
- python >= 3.6
- numpy
- nltk
- pymorphy2
- multiprocessing
The framework supports
python setup.py build
python setup.py install
- Implement the basic
ComplexityModel
class - Parallelization of
fit
method - Letter, Syllable, and Word Tokenizers for Russian
- Distance-based ComplexityFunction
- Morphological and Lexical complexity models
- Counter-based ComplexityFunction
- Counter-based models
- Adaptation of morphological models for English
- Syntax models based on UdPipe
- Making preprocessing more flexible
setup.py
and testing on Ubuntu, OSX- Publishing the Open-Source framework
==== You are here ====
- Publishing the
ComplexityPipeline
implementation to fit the aggregated complexity model - Publishing of distributions for all proposed models and validation data
- Enhancement of model weights ... (TBD)
complexity
- main must-import moduletokenizers
- most common tokenizers implementationfunctions
- most common complexity functions implementationdata
- all data used for experimenting
ComplexityModel
uses reference collection to build empirical distributions. The reference collection has to be provided in strictly fixed format.
- Each document of the collection must be saved in the separate
.txt
file. The name of file does not matter. - All files containing documents of the reference collection must be stored in the single directory.
- There should not be empty
.txt
files
Complexity model is a combination of two entities - Tokenizer and ComplexityFunction.
Both Tokenizer and ComplexityFunction are to be passed into constructor of the model.
Tokenizer is instance of some class required to have tokenize
method.
tokenize(text)
takes the only argument - text
, which is a string corresponding to a single document. Returns the list of tokens in order they are situated in give text. If text should be preprocessed in some way, preprocessing steps have to be implemented in tokenize
method.
Example:
class Tokenizer:
def tokenize(self, text):
return text.split()
ComplexityFunction is instance of the abstract class with the only required method - complexity
.
complexity(tokens)
takes the output of the tokenize
method, i.e. list of tokens as they are steered in the prior text. Method returns list of complexity scores for each token in the same order.
class ComplexityFunction:
def complexity(self, tokens):
return [len(token) for token in tokens]
Init
ComplexityModel
init options:
tokenizer
- Tokenizer instancecomplexity_function
- ComplexityFunction instancealphabet
-'full'
if alphabet consists of more than one token,'reduced'
otherwise. Default:'full'
Returns: model instance
Example:
tokenizer = Tokenizer()
complexity_function = ComplexityFunction()
cm = ComplexityModel(tokenizer, complexity_function, alphabet='reduced')
Fit
fit(reference_corpus, n_jobs=4, use_preproc=True, use_stem=True, use_lemm=False, check_length=True, check_stopwords=True)
- reference_corpus - path to directory with documents of reference collection. Each document must be presented in a separated
*.txt
file. - n_jobs - number of processes to process the collection. Default: 4
- use_preproc - flag indicating whether to preprocess the reference collection documents before tokenizing. Default: True
- use_stem - flag indicating whether to use stemming when preprocessing the reference collection documents. Default: True
- use_lemm - flag indicating whether to use lemmatization when preprocessing the reference collection documents. Default: True
- check_length - flag indicating whether to filter all words shorter than 3 symbols when preprocessing the reference collection documents. Default: True
- check_stopwords - flag indicating whether to filter stopwords when preprocessing the reference collection documents. Default: True
Returns nothing
fit
uses multiprocessing
to process documents of the reference collection in parallel.
Example:
cm.fit('/wikipedia', n_jobs=10, use_preproc=False, use_stem=False, use_lemm=False, check_stopwords=False, check_stopwords=False)
Predict
predict(texts, gamma=0.95, weights='mean', p=1, use_preproc=True, use_stem=True, use_lemm=False, check_length=True, check_stopwords=True, exp_weights=False, weights_min_shift=False, normalize=False, return_token_complexities=False)
texts
- lexts to estimate complexity scores forgamma
- quantile indicator. Default: 0.95weights
- Type of weights to use when counting the scoree. One of following options:'mean'
,'total'
,'excessive'
,'excessive_mean'
. Default:'mean'
. Default:'mean'
p
- power of the weights. Default: 1use_preproc
- lag indicating whether to preprocess text before tokenizing. Must align with the same parameter value used for fitting. Default: Trueuse_stem
- flag indicating whether to use lemmatization when preprocessing the text. Must align with the same parameter value used for fitting. Default: Trueuse_lemm
- power of the weights. Default: Falsecheck_length
- flag indicating whether to filter the words shorter than 3 symbols when preprocessing the text. Must align with the same parameter value used for fitting. Default: Truecheck_stopwords
- flag indicating whether to filter the stopwords when preprocessing the text. Must align with the same parameter value used for fitting. Default: Trueexp_weights
- flag indicating whether to apply exponential transformation to weights. Default: Falseweights_min_shift
- flag indicating whether to subtract the minimum value from the weights. Default: Falsenormalize
- flag indicating whether to normalize the weights. Default: Falsereturn_token_complexities
- flag indicating whether to return tokens complexities score along with the overall text complexity score. Default: False
Returns list of scores for the texts provided.
All following models were described in
models/letters
- distance-based morphological model- tokens: letters
- complexity: distance
models/lexical-distance
- distance-based lexical model- tokens: words
- complexity: distance
models/lexical-counter
- counter-based lexical model- tokens: words
- complexity: number of occurrences in the reference collection
models/lexical-length
- counter-based lexical model- tokens: words
- complexity: length of the word
models/re-syllab
- distance-based morphological model for Russian- tokens: syllables
- complexity: distance
models/ru-syllab-sorted
- distance-based morphological model for Russian- tokens: sorted syllables
- complexity: distance
models/en-syllab
- distance-based morphological model for English- tokens: syllables
- complexity: distance
models/en-syllab-sorted
- distance-based morphological model for English- tokens: sorted syllables
- complexity: distance
models/syntax-length
- counter-based syntactic model- tokens: sentences
- complexity: maximum length of the syntactic dependency
models/syntax-pos
- distance-based syntactic model- tokens: syntgams
- complexity: distance
@inproceedings{eremeev19ranlp,
title={Lexical Quantile-Based Text Complexity Measure},
author={M. A. Eremeev and Konstantin Vorontsov},
booktitle={RANLP},
year={2019}
}