Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

output classification #1

Open
Eviaiy opened this issue Feb 9, 2024 · 1 comment
Open

output classification #1

Eviaiy opened this issue Feb 9, 2024 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@Eviaiy
Copy link
Owner

Eviaiy commented Feb 9, 2024

Given our interest in manipulating the recognized data, we can consider developing a classifier that categorizes and converts text files into JSON format. This involves a process where data initially presented in table format is transformed into a text file, which is then further converted into a JSON file.

@Eviaiy Eviaiy added the enhancement New feature or request label Feb 9, 2024
@Eviaiy Eviaiy self-assigned this Feb 9, 2024
@Eviaiy
Copy link
Owner Author

Eviaiy commented Feb 9, 2024

Creating an automated system that converts tabular data from text files into JSON format involves several steps, each of which can be approached in different ways depending on the complexity and variability of the data. Here are some strategies you can consider:

  1. Rule-Based Parsing:
    Regular Expressions: Craft specific regular expressions to match and capture the structure of the data. This works well if the data follows a consistent pattern.

  2. Natural Language Processing (NLP):
    Named Entity Recognition (NER): Use NLP to identify and classify the entities in the text (e.g., "Energy" as a category and "2081 kJ / 497 kcal" as a value).

  3. Machine Learning Models:
    Custom Classifier: Train a classifier to identify parts of the text that correspond to different categories of the table.
    Sequence Labeling: Implement a sequence-to-sequence model like LSTM or BERT to tag parts of the sequences with appropriate labels (e.g., B-category, I-value) indicating the beginning and inside of a category or value.

  4. OCR with Built-in Structuring:
    Advanced OCR Solutions: Some OCR tools provide structured outputs that identify tables and lists (e.g., Google Cloud Vision API, Amazon Textract).

  5. Hybrid Approaches:
    Combine rule-based and ML-based approaches where rules handle standard cases and ML handles edge cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant