Multilingual Backpack Language Models

This repository contains the code, data, and experiments for the Multilingual Backpack Language Model, a project aimed at extending Backpack LMs to multilingual settings. Backpack LMs provide a flexible interface for interpretability and control in language modeling by explicitly encoding multiple senses for words. This work explores training Backpack LMs on parallel French-English corpora to efficiently handle polysemy in multilingual contexts.

1. Introduction

Backpack LMs learn multiple sense vectors per word, allowing for explicit modeling of polysemous words. Previously tested in monolingual settings for English and Chinese, this project extends the Backpack architecture to multilingual modeling by training on both English and French using Europarl and MultiUN datasets. The multilingual Backpack LM efficiently encodes word meanings across languages, demonstrating lower perplexity and improved accuracy on cloze tasks compared to baseline GPT-2 models.

2. Installation

Clone this repository.

git clone https://github.com/clemsadand/multilingual-backpack-lm.git
cd multilingual-backpack-lm/working_dir

You need to install NVIDIA-drivers. Run:

cd bkp_install
bash bkp_nvidia.sh

You may need to install anaconda or miniconda. To install miniconda, run:

cd bkp_install
bash anaconda.sh

You need to create a virtual environment with Python3.10.

With conda:

conda create --name bkp python=3.10
conda activate bkp # to activate

Without conda:

python3.10 -m venv bkp
source bkp/bin/activate #to activate

To install the required packages, run:

pip install numpy==1.23.5
pip install language_tool_python PyMultiDictionary tqdm wandb gdown tiktoken dataclasses datasets 
pip install torch==2.0.1
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118

3. Datasets

The multilingual Backpack LM is trained on the following datasets:

Europarl: Parallel French-English corpus from the European Parliament proceedings.
MultiUN: Parallel corpus extracted from United Nations documents. To download these datasets, run:

cd data
bash get_data.sh

We trained a customized tokenizer on Europarl and MultiUn with 10K tokens.

To tokenize these datasets and preprocess for training, run:

cd data
bash europarl/prepare.py
bash multiun/prepare.py

To use the tokenized and preprocessed data, download Europarl and MultiUN, and place them in data/europarl and data/multiun respectively.

Figure above present the workflow of the data preprocessing. The diagram illustrates the steps from merging bilingual corpora (Europarl and MultiUN) to training a customized BPE tokenizer. It includes processes for tokenizing the Europarl French corpus and splitting the tokenized data into train, validation, and test sets for further processing.

4. Training

We save the checkpoints of the different models trained on Europarl and MultiUN to Google Drive.

Model	Parameters	Number of sense vectors
Mini-GPT2	14M	-
Mini-Backpack-16	19M	16
Small-GPT2	93M	-
Small-Backpack-16	112M	16

To train a Backpack LM model or GPT2, follow these steps:

Configure the training setup:

Modify the configuration file in config/ to set up the training parameters (e.g., model_name, wandb_log, learning_rate, device).

Train the model:

Start training with the following command:

python3.10 train.py config/train_small_16.py --out_dir=out-bkp-small-16 --model_name=backpack-lm

Resume a training with following command:

python3.10 train.py config/train_small_16.py --out_dir=out-bkp-small-16 --model_name=backpack-lm --init_from=resume

5. Evaluation

The evaluation includes both intrinsic and extrinsic metrics:

Perplexity: Assesses the model’s ability to predict held-out text.

python3.10 perplexity_per_lang.py config/train_mini_16.py --model_name=backpack-lm --out_dir=out-bkp-mini-16 --device=cuda

Cloze task: Measures the model’s accuracy in filling in missing words.

python3.10 cloze_test.py --model_name=backpack-lm --out_dir=out-bkp-small-16 --device=cuda

Sense visualization: Analyzes the learned sense vectors for word representation.

python3.10 sense_visualisation.py --model_name=backpack-lm --out_dir=out-bkp-small-16 --device=cuda

6. Key Findings

This research marks the first application of Backpack LMs in multilingual settings, specifically training them on English and French corpora simultaneously.

6.1. Efficient Learning

The models efficiently learn word meanings without encoding language-specific sense vectors, allowing them to handle polysemous words effectively.

6.2. Performance Metrics

The Backpack LM (112M parameters) achieved lower perplexity scores compared to a baseline GPT2 (93M parameters). It slightly outperformed the baseline in a cloze task in top-1 accuracy.

6.3. Sense Visualisation

We found that the multilingual Backpack LMs learn different aspects of word meaning in different senses and these senses appear to serve the same function for both languages most of the time proving a language-independent senses. For example, sense 4 encodes different grammatical forms, with related nouns and adverbs in both languages for almost all words.

Sense 4 (English words)
rights	law	quick
rights	law	quick
Universal	law	quick
constitutions	jur	faster
right	Arrest	fast
Covenant	judges (juges)	quickest

Sense 4 (French words)
equality (égalité)	job (emploi)	necessary (nécessaire)
equality (égalité)	job (emploi)	necessary (nécessaire)
males (masculins)	job (emploi)	indispensable (indispensables)
discriminations	employment	necessary (nécessaire)
inequality (inégalité)	unemployed (chômeurs)	indispensable (indispensable)
feminine (féminin)	job (emploi)	essential (primordiales)

6.4. Sense Vector Analysis

The study found that the sense distributions learned by the Backpack LMs do not vary significantly across languages, suggesting that these models can effectively share sense vectors between languages without losing semantic accuracy.

Acknowledgements

This implementation is based on the Github repo nano-BackpackLM.

References

Backpack Language Models: Backpack Language Models by John Hewitt, John Thickstun, Christopher D. Manning, and Percy Liang (2023).
Character-level Chinese Backpack Language Models: Character-level Chinese Backpack Language Models by Hao Sun and John Hewitt (2023).

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
working_dir		working_dir
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multilingual Backpack Language Models

Table of Contents

1. Introduction

2. Installation

3. Datasets

4. Training

5. Evaluation

6. Key Findings

6.1. Efficient Learning

6.2. Performance Metrics

6.3. Sense Visualisation

6.4. Sense Vector Analysis

Acknowledgements

References

About

Releases

Packages

Languages

clemsadand/multilingual-backpack-lm

Folders and files

Latest commit

History

Repository files navigation

Multilingual Backpack Language Models

Table of Contents

1. Introduction

2. Installation

3. Datasets

4. Training

5. Evaluation

6. Key Findings

6.1. Efficient Learning

6.2. Performance Metrics

6.3. Sense Visualisation

6.4. Sense Vector Analysis

Acknowledgements

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages