Czech-T5-Base-Model

This is the t5 base model for the Czech language that is based on the smaller version of the google/mt5-base model. To make this model, I retained only the Czech and some of the English embeddings from the original multilingual model.

Modifications to the original multilingual t5 base model

Parameters of the original model were reduced from 582M to 244M parameters.
By choosing the top 20K Czech and 10K English tokens, sentencepiece vocabulary was shrinked from 250K to 30K tokens.
The original size was reduced from 2.2GB to 0.9GB.

Usage

To use the model from the 🤗/transformers library

# !pip install transformers sentencepiece

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("azizbarank/cst5-base")

model = AutoModelForSeq2SeqLM.from_pretrained("azizbarank/cst5-base")

Notes:

Since this is the base t5 model of the Czech language, before using it for any downstream tasks, it needs to be finetuned with appropriate datasets in the first place.
The link to the model: https://huggingface.co/azizbarank/cst5-base

References

The substantial amount of this work to create this model is mostly based on the the post written by David Dale:

"How to adapt a multilingual T5 model for a single language"

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE		LICENSE
README.md		README.md
cst5_base.ipynb		cst5_base.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Czech-T5-Base-Model

Modifications to the original multilingual t5 base model

Usage

References

About

Releases

Packages

Languages

License

azizbarank/Czech-T5-Base-Model

Folders and files

Latest commit

History

Repository files navigation

Czech-T5-Base-Model

Modifications to the original multilingual t5 base model

Usage

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages