Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve OCR Accuracy for Complex Scientific Tables #6

Closed
taiwanhuachenyu opened this issue Oct 17, 2024 · 2 comments
Closed

Improve OCR Accuracy for Complex Scientific Tables #6

taiwanhuachenyu opened this issue Oct 17, 2024 · 2 comments

Comments

@taiwanhuachenyu
Copy link

image
image
When recognizing complex tables containing scientific data, the current OCR system exhibits several accuracy issues. The main problems identified are:

Inaccurate table structure recognition: The system fails to correctly identify and preserve the original table's column and row structure and their relationships.
Column header recognition failure: Important column headers such as "Antibody", "VH Chain", "VL Chain" are not correctly recognized, resulting in loss of data context.
Data association errors: Values are not correctly associated with their corresponding column headers, leading to confusion between data from different columns.
Compromised data integrity: Some values (such as binding affinity KD values) are incorrectly split or combined, affecting data accuracy.
Special character and abbreviation recognition issues: Scientific notations like "SEQ ID NO:" and units such as "nM" are not correctly recognized or preserved.

Suggested improvements:

Enhance recognition capabilities for structured scientific data.
Improve algorithms for column header and table header recognition.
Increase accuracy in matching values to their corresponding columns.
Optimize recognition of scientific notations and units.

@VikParuchuri
Copy link
Owner

I am unable to reproduce this using the image you provided:
image

Sometimes PDFs have bad text in them. In this case, use the "detect cell bboxes" option to re-detect the cells and re-OCR the text. By default, the table text will be extracted from the PDF.

@conjuncts
Copy link

I don't know if this is applicable at all, but I happened to get a similar looking output when passing a table_bbox which didn't match the highres_image. When I cropped the highres_image to match the same size as the table_bbox, it was fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants