Skip to content

zeeshanalipanhwar/qaido

Repository files navigation

Qaido

Qaido is a Large-Scale Font-Diverse Sindhi Ligature Recognition benchmark dataset. It is a collection of synthesized 22,597 Sindhi ligatures in 256 different Sindhi fonts. In total it comprises of 5,784,832 images which are randomly split into train and test sets with ratio 75:25 based on font styles. There are also some mini-versions of the data based on the lengths of ligatures, or their frequencies.

The following image contains five random Sindhi Ligatures in 14 random fonts.

Contents

  1. Sindhi Script
  2. The Data
  3. Tutorials
  4. Pre-trained Models

Sindhi Script

The Persian alphabet is a modification of the Arabic alphabet with four additional letters. It became the basis of Sindhi alphabet with two digraphs and eighteen new letters. The Sindhi alphabet has 52 letters, which is twice the number of letters in English, and covering wide varieties of sounds. The following figure shows the extended Perso-Arabic Sindhi script that is read from right to left.

The Data

A collection of 104,145 unique words has been extracted from four different sources, Shah Jo Risalo, Sachal Jo Sindhi Kalaam, Quran Jo Tarjumo, and Digital South Asia Library. A total of 22,597 Sindhi ligatures are extracted from the collection of the words. These ligatures are written on gray images of size $80\times80$ and labeled with the ligatures as classes. The following table contains all sets of synthetic data.

Name Content Classes Examples Size Link MD5 Checksum
train.tar.xz training set images 22,597 4,338,624 2.65 GBytes Download c1df00bb2c8f3ca8f86981c173e11e60
test.tar.xz test set images 22,597 1,446,208 905 MBytes Download 393076c1f014a28d5bee285d4cb78b90
ligatures_map index to ligature mapping 22,597 22,597 219 KBytes Download d5afa69dea351c1df16da7f785d4b42c
train1.tar.xz training set images 55 10,560 4.1 MBytes Download 7d18159f33de2caa0dfccf3c91b8549b
train2.tar.xz training set images 1,119 214,848 116.5 MBytes Download 86350c846d471b0de55b248f73a5be4f
train3.tar.xz training set images 7,031 1,349,952 823.9 MBytes Download e05f2fa53547a453f026101ba18dfa0c
train4.tar.xz training set images 16,472 1,349,952 1.94 GBytes Download 1e4d0402347c47dfe40e850258278f7e
train_5000.tar.xz training set images 5,000 960,000 581.9 MBytes Download c1db3f5e8c530499aaa9da7e9c731f38
test1.tar.xz test set images 55 3,520 1.4 MBytes Download 6e9c9ee73607da4201504566994b6e09
test2.tar.xz test set images 1,119 7,616 38.7 MBytes Download 79aa49ba9c1638847fdaf9c58990f7c8
test3.tar.xz test set images 7,031 449,984 274.8 MBytes Download 4598e9a30cc989a9a4db7c17a76ac531
test4.tar.xz test set images 16,472 1,054,208 662.5 MBytes Download b0df91021bb4d0c6396bb31a2e0c6517
test_5000.tar.xz test set images 5,000 320,000 194.1 MBytes Download 8bf0d00db37004ca01595f9133241b01
ligatures_map_5000 index to ligature mapping 5,000 5,000 42 KBytes Download 93cf0c3228e42addaca9774a5fe07c30

Data format

The training and test data sets are arranged in the following data structure:

train
|
├── 0               // directory name is class index
│   ├── 0.png
│   ├── 1.png
│   └── ...
|
├── 1               
│   ├── 0.png
│   ├── 1.png
|   └─── ...
|
⋮
└──

Mapping directory/class to ligature

Use the following code to map a class directory to its corresponding ligature.

import codecs
with codecs.open('./data/ligatures_map', encoding='UTF-16LE') as ligature_file:
    ligatures_map = ligature_file.readlines()

class_idx = 22597
ligature = ligatures_map[class_idx]
print(ligature)

>>>ﺟﮭﻨﮕﻠﻴﭙﮣﻮ

Tutorials

For tutorials and code, use those of Qaida from here.

Pre-trained Models

The following table shows the models and their performance on their respective test sets of 64 unseen fonts.

Name Precision Recall Accuracy $\mathbf{F_1-Score}$ Size Link MD5 Checksum
SLRNet-22597 92.55% 91.85% 91.85% 91.95% 173.8 MBytes Download 7df8624c80d9ebf2d04fb250c3be89bb
SLRNet-5000 _ _ 90.00% _ 105 MBytes Download d73e1c98ae23b4c64c638aebb06fbb46

For a live demo of the SLRNet-22597 view qaido on HuggingFace spaces.

License

This project is licensed under the terms of the Creative Commons license.

Acknowledgements

This project structure followed guidlines from Qaida repository.

Citation

@INPROCEEDINGS{10410385,
  author={Ali, Zeeshan and Khan, Safdar Abbas and Khuram Shahzad, M. and Bilal, H. S. M.},
  booktitle={2023 International Conference on Frontiers of Information Technology (FIT)}, 
  title={A Large-Scale Font-Diverse Sindhi Ligature Recognition System}, 
  year={2023},
  volume={},
  number={},
  pages={132-137},
  keywords={Deep learning;Text recognition;Pipelines;Optical character recognition;Text detection;Benchmark testing;Rivers;Arabic Script;Sindhi Language;OCR;Deep Learning;Computer Vision;Digital Image Processing},
  doi={10.1109/FIT60620.2023.00033}}

Author

Maintainer Zeeshan Ali (zapt1860@gmail.com)

About

Large-Scale Font-Diverse Sindhi Ligature Recognition Datasets and Models

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages