This repository contains SentencePiece tokenizers trained over Wikipedia snapshots using the WikiLoader package. At the moment, this repository is maintained for our friends at the PeARS project, who are developing a multilingual, decentralised search engine. But you can of course use the models for whichever purposes you need. We will keep adding languages.
The vocabulary size is held constant across languages, at 8000 or 16000 wordpieces. Models are trained over the first 5M words of the snapshot.
For each language, you will need:
- a .vocab file containing the list of the 8000/16000 wordpieces used by the model
- a .model file containing the actual SentencePiece model
Those are stored in the respective vocabs/ and models/ directory, under the relevant language code.
In addition, English, French, German and Malayalam have nearest neighbours files corresponding to the 16000 wordpieces models, stored in the nns folder. Those have been generated by a FastText model trained on 100M wordpieces with the wikiloader package (40M for Malayalam, due to the overall size of the corresponding Wikipedia snapshot). NB: these files do not contain nearest neighbours for each wordpiece in the vocabulary, as they ignore pieces under a certain frequency threshold.