Analysis of feature importance for genre identification through data transformation
In this task, I analyse what importance different linguistic features have for the task of the automatic (web) genre identification (AGI) by comparing the performance of machine learning models, trained on various text representations. With this approach, I will be able to discover to which extent are lexical, grammatical and other features important for the identification of genre.
I perform text classification with the linear model fastText. For the experiments, I use the Slovene Web genre identification corpus GINCO 1.0 which consists of 1002 texts, manually annotated with 24 genre labels.
I train and test the fastText model on:
- baseline: plain text as extracted from the web during the creation of web corpora (used in previous experiments)
- pre-processed: lower-cased, punctuation removed, numbers removed
- reduced to lemmas
- transformed into part-of-speech tags: part-of-speech tags (upos) and morphosyntactic descriptors (MSD)
- transformed into syntactic dependencies
- (consisting only of the words belonging to a certain word type, i.e. only nouns, only verbs, only adjectives, etc.)
The setups are compared based on micro and macro F1 scores, to measure the models’ performance on the instance level and the label level, and confusion matrices.
The FastText model was used as it achieves the best results on the task, when compared with other common classifiers. The comparison is based on the baseline text and other classifiers used the TF-IDF representation. The classifiers are ordered based on the macro F1 scores.
Model | Micro F1 | Macro F1 |
---|---|---|
Dummy Classifier - Most Frequent | 0.241 | 0.078 |
Dummy Classifier - Stratified | 0.27 | 0.221 |
Support Vector Machine (SVC) | 0.489 | 0.333 |
Decision Tree | 0.34 | 0.35 |
Multinomial Naive Bayes classifier | 0.518 | 0.342 |
Logistic Regression | 0.518 | 0.383 |
Random Forest classifier | 0.511 | 0.408 |
Complement Naive Bayes classifier | 0.539 | 0.416 |
FastText | 0.56 | 0.589 |
See the notebook 1-Preparing_Data_Hyperparameter_Search.ipynb where I found the best hyperparameters for the task, and 2-Language-Processing-of-GINCO.ipynb where I linguistically preprocessed data with the CLASSLA pipeline.
Data:
- GINCO corpus with "keep" texts (reasons: more text than if we would use the deduplicated paragraphs only, but certain manually-annotated duplicates omitted as they can be unrepresentative for the genre type)
- smaller number of labels: downsampled 12 set, labels with too few instances discarded, fuzzy labels (Other, List of Summaries/Excerpts) discarded, texts marked with Hard discarded --> 5 labels, 688 texts
- original stratified train-dev-test split (60:20:20): 410:141:137
Preliminary experiments:
- Optimising FastText - hyperparameter search on dev split --> average micro and macro F1 scores of 0.625 +/- 0.0036 and 0.618 +/- 0.003
- Experiments on no. of epochs --> 350 epochs used
Main conclusions:
- the best textual representation is syntactic dependencies.
- some genre labels favor lexical textual representations, others, such as Forum are better classified when using grammatical representations
For more figures regarding the results, see the folder results. The script for analyzing results is 6-Result_Analysis.ipynb.
Results (details):
- baseline text: micro F1: 0.56 +/- 0.0, macro F1: 0.589 +/- 0.0
- lower-cased: micro F1: 0.553 +/- 0.0045, macro F1: 0.587 +/- 0.009 - slightly lower results
- punctuation removed: micro F1: micro F1: 0.58 +/- 0.0028, macro F1: 0.616 +/- 0.0024 - improved results, especially forum (see the graph)
- numbers removed: micro F1: 0.583 +/- 0.0028, macro F1: 0.595 +/- 0.0025 - slight improvement, except in Forum, where it is worse
- lower-cased, punctuation removed, numbers removed: micro F1: 0.56 +/- 0.0, macro F1: 0.598 +/- 0.0 - no improvements in micro level, very slight improvements in macro level - improvement in forum, otherwise mostly no
- lower-cased, punctuation removed, numbers removed, stopwords removed: micro F1: 0.596 +/- 0.0, macro F1: 0.597 +/- 0.00029 - improvement, more in micro than macro
- lemmas: micro F1: 0.597 +/- 0.0053, macro F1: 0.601 +/- 0.0035 - significant improvement over the baseline, especially for Information/Explanation, Promotion, for News no change, for Forum and Opinion worse
- part-of-speech tags (upos): micro F1: 0.54 +/- 0.0053, macro F1: 0.547 +/- 0.0056 - decrease overall, but in News and Opinion increase
- morphosyntactic descriptors (MSD): micro F1: 0.563 +/- 0.0072, macro F1: 0.536 +/- 0.019, increase in micro, decrease in macro, improvement in News and Information, high variation in Forum
- syntactic dependencies: micro F1: 0.61 +/- 0.0, macro F1: 0.639 +/- 0.00044 - the best results, high improvement, especially in News, Forum, Opinion. Decrease in Promotion.
Reduced features (lemmas in selected PoS class used, other replaced with O):
- only open class words - stopwords removed (ADP, AUX, CCONJ and SCONJ, DET, NUM, PART and PRON): micro F1: 0.563 +/- 0.0072, macro F1: 0.535 +/- 0.015 - decrease in macro, slight increase in micro - stopwords do not have a big impact - huge decrease in Forum
- only stop words: micro F1: 0.526 +/- 0.0053, macro F1: 0.559 +/- 0.0067 - decrease, especially in micro - same result for Forum, decrease in News and Information
- only classes which denote subjectivity - ADJ, ADV, PART: micro F1: 0.468 +/- 0.009, macro F1: 0.408 +/- 0.019 - much lower results, huge decrease for Forum, for Opinion actually slightly better results
- only PROPN, NOUN and VERB: micro F1: 0.496 +/- 0.0078, macro F1: 0.439 +/- 0.015 - decrease, for Information/Explanation increase, for others decrease, especially big for Forum and Opinion
Alternative representations without context (window = 1) - very very slight difference:
- baseline text: micro F1: 0.559 +/- 0.0028, macro F1: 0.588 +/- 0.002 - slightly different, almost the same
- lemmas: micro F1: 0.597 +/- 0.0053, macro F1: 0.602 +/- 0.0039
- part-of-speech tags (upos): micro F1: 0.546 +/- 0.0078, macro F1: 0.555 +/- 0.012
- morphosyntactic descriptors (MSD): micro F1: 0.566 +/- 0.0069, macro F1: 0.539 +/- 0.014
- syntactic dependencies: micro F1: 0.609 +/- 0.0028, macro F1: 0.637 +/- 0.0026
Additional experiment - on all 12 labels (primary_level_3), all 1002 texts:
- baseline: micro F1: 0.425 +/- 0.0043, macro F1: 0.273 +/- 0.005
- dependencies: micro F1: 0.48 +/- 0.0018, macro F1: 0.337 +/- 0.018 - improved results
To compare the fastText's performance with the performance of Transformer models, I trained and tested the base-sized XLM-RoBERTa model on the baseline text.
During the hyperparameter search, I searched for the optimum number of epochs, which revealed to be 13. The hyperparameters that we used are the following:
args= {
"overwrite_output_dir": True,
"num_train_epochs": 13,
"train_batch_size":8,
"learning_rate": 1e-5,
"labels_list": LABELS,
"max_seq_length": 512,
"save_steps": -1,
# Only the trained model will be saved - to prevent filling all of the space
"save_model_every_epoch":False,
"wandb_project": 'GINCO-hyperparameter-search',
"silent": True,
}
The trained model was saved to the Wandb directory:
import wandb
run = wandb.init()
# Load the saved model
artifact = run.use_artifact('tajak/GINCO-hyperparameter-search/GINCO-5-labels-classifier:v0', type='model')
artifact_dir = artifact.download()
# Loading a local save
model = ClassificationModel(
"xlmroberta", "artifacts/GINCO-5-labels-classifier:v0")
Results on dev split: Macro f1: 0.82, Micro f1: 0.818 Results on test split: Macro f1: 0.813, Micro f1: 0.816