NRU HSE, Fundamental and computational linguistics, Moscow 2022
All collected and utilized data is provided in the corresponding folders and files (see detailed structure below).
Code for replicating one of the models is also provided in this fork.
vk data, 1 task
vk data, 2 tasks
several source data, 1 task
several source data, 2 tasks
In case links become unavailable, feel free to contact me on
pipe = inferPipeline(modelPath = 'sample_dir/',
maxSeqLen = 128)
# for predicting on one task:
output = pipe.infer([['every text is in'], ['separate list']],
# for predicting on both tasks:
output = pipe.infer([['every text is in'], ['separate list']],
['ToxicityDetection', 'DistortionDetection'])
For more details, please refer to the multi-task-NLP documentation.
├── hypothesis_testing_data # data needed to test the hypothesis
│ ├── uncorrected_data_NEW.tsv # uncorrected test comments
│ ├── corrected_data_NEW.tsv # test comments with manual correction
| └── preprocessed_data_NEW.tsv # test comments preprocessed automatically
├── preprocessing_data # data needed for preprocessing approach
│ ├── bad_wordlist.txt # list of offensive, obscene and otherwise toxic words
| └── replacement.json # rules for replacing cyrillic letters
├── toxicity_corpus # folder for publishing collected distorted toxicity data
│ ├── # data statement fot the corpus
| └── distorted_toxicity.tsv # corpus file
├── training_data # train and val data and task files for training neural networks
│ ├── ...
├── Testing models.ipynb # notebook for first experiment
├── Approach 1 - preprocessing.ipynb # notebook for first approach of second experiment
├── Approach 2 - MT BERT.ipynb # notebook for first approach of second experiment
├── parsing and preparing data.ipynb # code for getting and structuring data
├── corpus analysis.ipynb # code for counting some corpus statistics