Skip to content

Latest commit

 

History

History
74 lines (48 loc) · 6.31 KB

README.md

File metadata and controls

74 lines (48 loc) · 6.31 KB

Text Extration

A modular framework for extracting text from many different sources (websites, PDFs, images).

Text Extractors comparison

PDF

There are two types of PDF:

  • "Image only" PDFs that just embed (scanned) images. But they contain no selectable and therefore extractable text. To get the text in the images, first the images have to be extracted from the PDF and then OCR applied to them. See section Images.
  • Searchable PDFs: If you open them in a PDF viewer you can select their text or search for it. The following libraries help to extract text from these types of PDFs:

Searchable PDFs

Extractor Permissive License Runs on Android Advantages Disadvantages
pdftotext ✔️
  • Best PDF extraction result so far
iText 2 ✔️ ✔️
  • Works also with PDFs with disordered layouts
  • Best PDF extraction result of any Java library I found
  • Works on older Androids (at least on Android 4.1)
  • Almost the same text extraction quality as the newer (and non-free) iText 7
iText ✔️
  • Works also with PDFs with disordered layouts
  • Best PDF extraction result of any Java library I found
  • Works on older Androids (at least on Android 4.1)
  • Not free / commercial (AGPL / commercial license)
OpenPDF ✔️ (:heavy_check_mark:)
  • Free
  • Quite good and fast
  • Does not work on PDFs with disordered layouts
  • Does not run on older Androids (uses Java 8 features (Optional); works on Android 6 but not on Android 4.1, others not tested)
PDFBox (not added yet) ✔️
PdfBox-Android (not added yet) ✔️ ✔️
iText 2 and iText 7

iText 2 is the older, permissive version of then turned commercial iText. But as the last free iText version, 2.1.7, has security flaws, I used version 2.1.7.js7 from JasperReports as this version fixes the security issues. It's slower than iText 7 but in regard to text extraction quality I cannot see any difference between iText 7 and iText 2.

OpenPdf

OpenPdf took the last commit with a permissive license of iText and developed it further. But according to my experience its text extraction capability is worse than that one of iText 7 and iText 2.

Do not add OpenPdfPdfTextExtractor and iText2PdfTextExtractor to the class path at the same time as both have the same package and class names but different method and class signatures -> one of them will crash when using them.

(Very opinionated) Recommendation

If you can use pdftotext (Poppler), use pdftotext. It yields the best results both in terms of text extraction quality and speed.

Otherwise use security issues fixed version of iText 2. It's slower than commercial (and really amazing good) iText 7, but in terms of text extraction quality I cannot see any difference between iText 2 and iText 7.

I don't know why, but of some PDFs OpenPdf cannot extract any text at all.

How to distinguish between Searchable and "Image only" PDFs?

Kurt Pfeifle gave an superb hint (https://stackoverflow.com/a/3108531): Check how many fonts a PDF uses. If it uses fonts, it contains searchable text. If it uses no font at all it contains only images.

I added IPdfTypeDetector implementations for Poppler / pdffonts and ...

Images

(All variants with Tesseract 4 have the same extraction quality, which is quite good but not the best.)

Extractor Advantages Disadvantages
tess4j
  • Uses Tesseract 4
  • User has to install Tesseract
  • Extraction result depends a lot on image quality
  • Does not run on Android
Tesseract 4 over JNI (e. g. from Bytedeco)
  • Uses Tesseract 4
  • If there's an exception in native code whole application crashes (JNI)
  • User has to install Tesseract
  • Extraction result depends a lot on image quality
  • Does not run on Android
Tesseract4Android
  • Uses Tesseract 4
  • Very slow, took 2 minutes to recognize a single image (0,5 MB)
  • Extraction result depends a lot on image quality
Tess4Android
  • Uses Tesseract 4
  • Couldn't get it to compile
TextFairy (not added yet)
  • Uses Tesseract 3
  • Quite slow
  • Extraction result depends a lot on image quality
Microsoft Cloud Computer Vision API OCR (not implemented yet)
  • Best image extraction result I found so far
  • Requires registration (credit card required; every single user to do this for his/her self)
  • Costs $1.50 per 1000 images (see)
  • Data protection insanity, stores all your images and recognized text for years
Google Cloud Vision OCR (neither implemented nor tested yet)
  • Requires registration (credit card required; every single user to do this for his/her self)
  • 1000 images per month are free, have to pay for more (see)
  • Data protection insanity, stores all your images and recognized text for years

License

If not stated otherwise all code is licensed under Apache License, Version 2.0.

Notice: Some libraries, like iText, have different, partially commercial licenses.