`asr-datasets-cleaner`

Warning

Currently, this work is in progress.

This repository contains a pipeline for better ASR training solving these two tasks: (1) remove incorrect audio samples from ASR datasets by LID filtering and (2) normalize text samples.

Authors:

Yehor Smoliakov: @egorsmkv on GitHub, and egorsmkv@gmail.com for private discussions.

Idea

Use https://huggingface.co/facebook/mms-lid-126 to detect the language in audio samples.
Use https://github.com/pemistahl/lingua-py to detect the language in text samples.
Use https://huggingface.co/skypro1111/mbart-large-50-verbalization to do text normalization (convert numerals/abbreviations to their textual representation, that is, $5 -> five dollars).

Details

We use the Ukrainian subset of YODAS2 in our command examples.
We patch the YODAS2's dataset builder script to download only a part of the dataset.

Required software

Python 3.12
uv
nq
CUDA device

Install

uv venv --python 3.12

source .venv/bin/activate

uv pip install -r requirements.txt

# in development mode
uv pip install -r requirements-dev.txt

Usage

Generate a bash file to download required files from YODAS2:

python generate_commands.py --dataset_dir `pwd`/uk_yodas2 --subset uk000 --max_files 10 > download_dataset.sh

Download the dataset:

bash download_dataset.sh

Convert the dataset to datasets format:

Copy the yodas2_dsbuilder.py file to your dataset_dir directory and rename it as dataset_dir. So in the following example, the dataset_dir is uk_yodas2 and the script must be renamed as uk_yodas2.py.

Then convert the dataset, it will unarchive files and generate metadata:

python convert_dataset.py --dataset_dir `pwd`/uk_yodas2 --subset uk000 --max_files 10 --cache_dir cache-yodas2-uk000

Extract utterances:

python extract_utterances.py --dataset_dir `pwd`/uk_yodas2 --subset uk000 --cache_dir ../cache-yodas2-uk000 --batch_size 128 > data/uk000.jsonl

Text LID:

python text_lid.py --file data/uk000.jsonl --to data/uk000_+tlid.jsonl

Filter by a language:

python filter_by_language.py --file data/uk000_+tlid.jsonl --to data/uk000_+only_uk.jsonl --language uk --score 0.95

Audio LID:

python audio_lid.py --dataset_dir `pwd`/uk_yodas2 --subset uk000 --cache_dir ../cache-yodas2-uk000 --batch_size 16 --model_id facebook/mms-lid-126 --file data/uk000_+tlid.jsonl --to data/uk000_+tlid_+alid.jsonl --device cuda:0

Normalize utterances:

python normalize_utterances.py --file data/uk000.jsonl --to data/uk000_normalized.jsonl --batch_size 8 --device cuda:0

Examples

Go to examples/
Inference audio samples by the different variants of MMS LID model to see their outputs:

python audio_lid.py --model_id facebook/mms-lid-126 --dataset_dir `pwd`/../uk_yodas2 --subset uk000 --cache_dir ../cache-yodas2-uk000 --device cuda:0 > ../mms-checkpoints-test/mms-lid-126.txt

Inference text samples by lingua-py to see their text language:

python text_lid.py --dataset_dir `pwd`/../uk_yodas2 --subset uk000 --cache_dir ../cache-yodas2-uk000

Inference text samples by the MBART model for text normalization:

python normalize_utterances.py

Calculate the duration of the dataset:

python count_durations.py --dataset_dir `pwd`/../uk_yodas2 --subset uk000 --cache_dir ../cache-yodas2-uk000 --batch_size 128

Development

ruff check
ruff format

Misc

MMS has these models for the LID task:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`asr-datasets-cleaner`

Idea

Details

Required software

Install

Usage

Examples

Development

Misc

About

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
examples		examples
mms-checkpoints-test		mms-checkpoints-test
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
audio_lid.py		audio_lid.py
convert_dataset.py		convert_dataset.py
copy_correct_wavs.py		copy_correct_wavs.py
extract_correct_utterance_texts.py		extract_correct_utterance_texts.py
extract_correct_utterance_texts_tsv.py		extract_correct_utterance_texts_tsv.py
extract_correct_utterances_kaldi.py		extract_correct_utterances_kaldi.py
extract_utterances.py		extract_utterances.py
extract_wav_utterances.py		extract_wav_utterances.py
filter_by_language.py		filter_by_language.py
generate_commands.py		generate_commands.py
normalize_utterances.py		normalize_utterances.py
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
text_lid.py		text_lid.py
yodas2_dsbuilder.py		yodas2_dsbuilder.py

License

egorsmkv/asr-datasets-cleaner

Folders and files

Latest commit

History

Repository files navigation

asr-datasets-cleaner

Idea

Details

Required software

Install

Usage

Examples

Development

Misc

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

`asr-datasets-cleaner`