This repository contains the code, data, and experiments for the Multilingual Backpack Language Model, a project aimed at extending Backpack LMs to multilingual settings. Backpack LMs provide a flexible interface for interpretability and control in language modeling by explicitly encoding multiple senses for words. This work explores training Backpack LMs on parallel French-English corpora to efficiently handle polysemy in multilingual contexts.
Backpack LMs learn multiple sense vectors per word, allowing for explicit modeling of polysemous words. Previously tested in monolingual settings for English and Chinese, this project extends the Backpack architecture to multilingual modeling by training on both English and French using Europarl and MultiUN datasets. The multilingual Backpack LM efficiently encodes word meanings across languages, demonstrating lower perplexity and improved accuracy on cloze tasks compared to baseline GPT-2 models.
- Clone this repository.
git clone https://github.com/clemsadand/multilingual-backpack-lm.git
cd multilingual-backpack-lm/working_dir
- You need to install NVIDIA-drivers. Run:
cd bkp_install
bash bkp_nvidia.sh
- You may need to install anaconda or miniconda. To install miniconda, run:
cd bkp_install
bash anaconda.sh
- You need to create a virtual environment with Python3.10.
- With conda:
conda create --name bkp python=3.10
conda activate bkp # to activate
- Without conda:
python3.10 -m venv bkp
source bkp/bin/activate #to activate
- To install the required packages, run:
pip install numpy==1.23.5
pip install language_tool_python PyMultiDictionary tqdm wandb gdown tiktoken dataclasses datasets
pip install torch==2.0.1
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118
The multilingual Backpack LM is trained on the following datasets:
- Europarl: Parallel French-English corpus from the European Parliament proceedings.
- MultiUN: Parallel corpus extracted from United Nations documents. To download these datasets, run:
cd data
bash get_data.sh
We trained a customized tokenizer on Europarl and MultiUn with 10K tokens.
- To tokenize these datasets and preprocess for training, run:
cd data
bash europarl/prepare.py
bash multiun/prepare.py
- To use the tokenized and preprocessed data, download Europarl and MultiUN, and place them in
data/europarl
anddata/multiun
respectively.
Figure above present the workflow of the data preprocessing. The diagram illustrates the steps from merging bilingual corpora (Europarl and MultiUN) to training a customized BPE tokenizer. It includes processes for tokenizing the Europarl French corpus and splitting the tokenized data into train, validation, and test sets for further processing.
We save the checkpoints of the different models trained on Europarl and MultiUN to Google Drive.
Model | Parameters | Number of sense vectors |
---|---|---|
Mini-GPT2 | 14M | - |
Mini-Backpack-16 | 19M | 16 |
Small-GPT2 | 93M | - |
Small-Backpack-16 | 112M | 16 |
To train a Backpack LM model or GPT2, follow these steps:
- Configure the training setup:
- Modify the configuration file in config/ to set up the training parameters (e.g.,
model_name
,wandb_log
,learning_rate
,device
).
- Train the model:
- Start training with the following command:
python3.10 train.py config/train_small_16.py --out_dir=out-bkp-small-16 --model_name=backpack-lm
- Resume a training with following command:
python3.10 train.py config/train_small_16.py --out_dir=out-bkp-small-16 --model_name=backpack-lm --init_from=resume
The evaluation includes both intrinsic and extrinsic metrics:
- Perplexity: Assesses the model’s ability to predict held-out text.
python3.10 perplexity_per_lang.py config/train_mini_16.py --model_name=backpack-lm --out_dir=out-bkp-mini-16 --device=cuda
- Cloze task: Measures the model’s accuracy in filling in missing words.
python3.10 cloze_test.py --model_name=backpack-lm --out_dir=out-bkp-small-16 --device=cuda
- Sense visualization: Analyzes the learned sense vectors for word representation.
python3.10 sense_visualisation.py --model_name=backpack-lm --out_dir=out-bkp-small-16 --device=cuda
This research marks the first application of Backpack LMs in multilingual settings, specifically training them on English and French corpora simultaneously.
The models efficiently learn word meanings without encoding language-specific sense vectors, allowing them to handle polysemous words effectively.
The Backpack LM (112M parameters) achieved lower perplexity scores compared to a baseline GPT2 (93M parameters). It slightly outperformed the baseline in a cloze task in top-1 accuracy.
We found that the multilingual Backpack LMs learn different aspects of word meaning in different senses and these senses appear to serve the same function for both languages most of the time proving a language-independent senses. For example, sense 4 encodes different grammatical forms, with related nouns and adverbs in both languages for almost all words.
Sense 4 (English words) | ||
---|---|---|
rights | law | quick |
rights | law | quick |
Universal | law | quick |
constitutions | jur | faster |
right | Arrest | fast |
Covenant | judges (juges) | quickest |
Sense 4 (French words) | ||
---|---|---|
equality (égalité) | job (emploi) | necessary (nécessaire) |
equality (égalité) | job (emploi) | necessary (nécessaire) |
males (masculins) | job (emploi) | indispensable (indispensables) |
discriminations | employment | necessary (nécessaire) |
inequality (inégalité) | unemployed (chômeurs) | indispensable (indispensable) |
feminine (féminin) | job (emploi) | essential (primordiales) |
The study found that the sense distributions learned by the Backpack LMs do not vary significantly across languages, suggesting that these models can effectively share sense vectors between languages without losing semantic accuracy.
This implementation is based on the Github repo nano-BackpackLM.
-
Backpack Language Models: Backpack Language Models by John Hewitt, John Thickstun, Christopher D. Manning, and Percy Liang (2023).
-
Character-level Chinese Backpack Language Models: Character-level Chinese Backpack Language Models by Hao Sun and John Hewitt (2023).