Qaido

Qaido is a Large-Scale Font-Diverse Sindhi Ligature Recognition benchmark dataset. It is a collection of synthesized 22,597 Sindhi ligatures in 256 different Sindhi fonts. In total it comprises of 5,784,832 images which are randomly split into train and test sets with ratio 75:25 based on font styles. There are also some mini-versions of the data based on the lengths of ligatures, or their frequencies.

The following image contains five random Sindhi Ligatures in 14 random fonts.

The Persian alphabet is a modification of the Arabic alphabet with four additional letters. It became the basis of Sindhi alphabet with two digraphs and eighteen new letters. The Sindhi alphabet has 52 letters, which is twice the number of letters in English, and covering wide varieties of sounds. The following figure shows the extended Perso-Arabic Sindhi script that is read from right to left.

The Data

A collection of 104,145 unique words has been extracted from four different sources, Shah Jo Risalo, Sachal Jo Sindhi Kalaam, Quran Jo Tarjumo, and Digital South Asia Library. A total of 22,597 Sindhi ligatures are extracted from the collection of the words. These ligatures are written on gray images of size $80\times80$ and labeled with the ligatures as classes. The following table contains all sets of synthetic data.

Name	Content	Classes	Examples	Size	Link	MD5 Checksum
`train.tar.xz`	training set images	22,597	4,338,624	2.65 GBytes	Download	`c1df00bb2c8f3ca8f86981c173e11e60`
`test.tar.xz`	test set images	22,597	1,446,208	905 MBytes	Download	`393076c1f014a28d5bee285d4cb78b90`
`ligatures_map`	index to ligature mapping	22,597	22,597	219 KBytes	Download	`d5afa69dea351c1df16da7f785d4b42c`
`train1.tar.xz`	training set images	55	10,560	4.1 MBytes	Download	`7d18159f33de2caa0dfccf3c91b8549b`
`train2.tar.xz`	training set images	1,119	214,848	116.5 MBytes	Download	`86350c846d471b0de55b248f73a5be4f`
`train3.tar.xz`	training set images	7,031	1,349,952	823.9 MBytes	Download	`e05f2fa53547a453f026101ba18dfa0c`
`train4.tar.xz`	training set images	16,472	1,349,952	1.94 GBytes	Download	`1e4d0402347c47dfe40e850258278f7e`
`train_5000.tar.xz`	training set images	5,000	960,000	581.9 MBytes	Download	`c1db3f5e8c530499aaa9da7e9c731f38`
`test1.tar.xz`	test set images	55	3,520	1.4 MBytes	Download	`6e9c9ee73607da4201504566994b6e09`
`test2.tar.xz`	test set images	1,119	7,616	38.7 MBytes	Download	`79aa49ba9c1638847fdaf9c58990f7c8`
`test3.tar.xz`	test set images	7,031	449,984	274.8 MBytes	Download	`4598e9a30cc989a9a4db7c17a76ac531`
`test4.tar.xz`	test set images	16,472	1,054,208	662.5 MBytes	Download	`b0df91021bb4d0c6396bb31a2e0c6517`
`test_5000.tar.xz`	test set images	5,000	320,000	194.1 MBytes	Download	`8bf0d00db37004ca01595f9133241b01`
`ligatures_map_5000`	index to ligature mapping	5,000	5,000	42 KBytes	Download	`93cf0c3228e42addaca9774a5fe07c30`

Data format

The training and test data sets are arranged in the following data structure:

train
|
├── 0               // directory name is class index
│   ├── 0.png
│   ├── 1.png
│   └── ...
|
├── 1               
│   ├── 0.png
│   ├── 1.png
|   └─── ...
|
⋮
└──

Mapping directory/class to ligature

Use the following code to map a class directory to its corresponding ligature.

import codecs
with codecs.open('./data/ligatures_map', encoding='UTF-16LE') as ligature_file:
    ligatures_map = ligature_file.readlines()

class_idx = 22597
ligature = ligatures_map[class_idx]
print(ligature)

>>>  ‫ﺟﮭﻨﮕﻠﻴﭙﮣﻮ‬

Tutorials

For tutorials and code, use those of Qaida from here.

Pre-trained Models

The following table shows the models and their performance on their respective test sets of 64 unseen fonts.

Name	Precision	Recall	Accuracy	$\mathbf{F_1-Score}$	Size	Link	MD5 Checksum
`SLRNet-22597`	92.55%	91.85%	91.85%	91.95%	173.8 MBytes	Download	`7df8624c80d9ebf2d04fb250c3be89bb`
`SLRNet-5000`	_	_	90.00%	_	105 MBytes	Download	`d73e1c98ae23b4c64c638aebb06fbb46`

For a live demo of the SLRNet-22597 view qaido on HuggingFace spaces.

License

This project is licensed under the terms of the Creative Commons license.

Acknowledgements

This project structure followed guidlines from Qaida repository.

Citation

@INPROCEEDINGS{10410385,
  author={Ali, Zeeshan and Khan, Safdar Abbas and Khuram Shahzad, M. and Bilal, H. S. M.},
  booktitle={2023 International Conference on Frontiers of Information Technology (FIT)}, 
  title={A Large-Scale Font-Diverse Sindhi Ligature Recognition System}, 
  year={2023},
  volume={},
  number={},
  pages={132-137},
  keywords={Deep learning;Text recognition;Pipelines;Optical character recognition;Text detection;Benchmark testing;Rivers;Arabic Script;Sindhi Language;OCR;Deep Learning;Computer Vision;Digital Image Processing},
  doi={10.1109/FIT60620.2023.00033}}

Author

Maintainer Zeeshan Ali (zapt1860@gmail.com)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Qaido

Contents

Sindhi Script

The Data

Data format

Mapping directory/class to ligature

Tutorials

Pre-trained Models

For a live demo of the SLRNet-22597 view qaido on HuggingFace spaces.

License

Acknowledgements

Citation

Author

Files

README.md

Latest commit

History

README.md

File metadata and controls

Qaido

Contents

Sindhi Script

The Data

Data format

Mapping directory/class to ligature

Tutorials

Pre-trained Models

For a live demo of the SLRNet-22597 view qaido on HuggingFace spaces.

License

Acknowledgements

Citation

Author