Skip to content

Code and data for my thesis "Automatic toxic comment detection in social media for Russian"

Notifications You must be signed in to change notification settings

alla-g/toxicity-detection-thesis

Repository files navigation

Automatic Toxic Comment Detection in Social Media for Russian

NRU HSE, Fundamental and computational linguistics, Moscow 2022

All collected and utilized data is provided in the corresponding folders and files (see detailed structure below).
Code for replicating one of the models is also provided in this fork.

Links to the trained single- and multitask BERT models:

vk data, 1 task
https://drive.google.com/uc?id=1barEeEUgEUXHHkYN-l-s2i8AZYEyTtYp
vk data, 2 tasks
https://drive.google.com/uc?id=1--iwGBQHBUXXktC9kqHllnmPzwN9wRYz
several source data, 1 task
https://drive.google.com/uc?id=1gJ1IPzpaVG81EzyyF7l9m67L_IbH4uZQ
several source data, 2 tasks
https://drive.google.com/uc?id=1Xu-4-3kYv8HCU2j7zgx84FZm778lzIKk

In case links become unavailable, feel free to contact me on alla.s.gorbunova@gmail.com

Quick examples on how to infer multitask models in Colab:

!gdown https://drive.google.com/link_from_above

pipe = inferPipeline(modelPath = 'sample_dir/model.pt',
                     maxSeqLen = 128)
# for predicting on one task:
output = pipe.infer([['every text is in'], ['separate list']],
                    ['ToxicityDetection'])
# for predicting on both tasks:
output = pipe.infer([['every text is in'], ['separate list']],
                    ['ToxicityDetection', 'DistortionDetection'])

For more details, please refer to the multi-task-NLP documentation.

Repository structure:

├── hypothesis_testing_data  # data needed to test the hypothesis  
│   ├── uncorrected_data_NEW.tsv  # uncorrected test comments  
│   ├── corrected_data_NEW.tsv  # test comments with manual correction  
|   └── preprocessed_data_NEW.tsv  # test comments preprocessed automatically  
│  
├── preprocessing_data  # data needed for preprocessing approach  
│   ├── bad_wordlist.txt  # list of offensive, obscene and otherwise toxic words  
|   └── replacement.json  # rules for replacing cyrillic letters  
│  
├── toxicity_corpus  # folder for publishing collected distorted toxicity data  
│   ├── DATASTATEMENT.md  # data statement fot the corpus  
|   └── distorted_toxicity.tsv  # corpus file  
│      
├── training_data  # train and val data and task files for training neural networks  
│   ├── ...     
│  
├── Testing models.ipynb  # notebook for first experiment  
├── Approach 1 - preprocessing.ipynb  # notebook for first approach of second experiment  
├── Approach 2 - MT BERT.ipynb  # notebook for first approach of second experiment  
├── parsing and preparing data.ipynb # code for getting and structuring data  
├── corpus analysis.ipynb # code for counting some corpus statistics  
└── README.md

About

Code and data for my thesis "Automatic toxic comment detection in social media for Russian"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published