This is the t5 base model for the Czech language that is based on the smaller version of the google/mt5-base model. To make this model, I retained only the Czech and some of the English embeddings from the original multilingual model.
-
Parameters of the original model were reduced from 582M to 244M parameters.
-
By choosing the top 20K Czech and 10K English tokens, sentencepiece vocabulary was shrinked from 250K to 30K tokens.
-
The original size was reduced from 2.2GB to 0.9GB.
To use the model from the 🤗/transformers library
# !pip install transformers sentencepiece
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("azizbarank/cst5-base")
model = AutoModelForSeq2SeqLM.from_pretrained("azizbarank/cst5-base")
Notes:
- Since this is the base t5 model of the Czech language, before using it for any downstream tasks, it needs to be finetuned with appropriate datasets in the first place.
- The link to the model: https://huggingface.co/azizbarank/cst5-base
The substantial amount of this work to create this model is mostly based on the the post written by David Dale:
"How to adapt a multilingual T5 model for a single language"