Hi Reinaldo,
here my experiences with tesseract so far, perhaps it helps in some cases:
[list:2436i3tb]- I want direct output of searchable pdf - not only hOCR -, possible with version 3.03 and higher
- I want to run tesseract on windows engines, so I was looking after windows bins version 3.03 or higher
- Found them here [url:2436i3tb]https://github.com/UB-Mannheim/tesseract/wiki[/url:2436i3tb]
- My first and last experience with tiff was this 'Error in pixReadFromTiffStream: spp not in set {1,3,4}'
- So I've changed to png with good results, as recommended here: [url:2436i3tb]http://stackoverflow.com/questions/5083492/tesseract-and-tiff-format-spp-not-in-set-1-3[/url:2436i3tb]: 'Tesseract (well, Leptonica) accepts PNGs these days and is less picky about them, so it might be easier to migrate your workflow to PNG anyway.'
- Later on, I changed picture splitting to PDFImages, it generates ppm/pbm and optional jpg pictures [b:2436i3tb]on the fly[/b:2436i3tb].
- I experienced good OCR results, perhaps due to this fact: 'pdfimages extracts the raw image data from the PDF file, without performing any additional transforms. Any rotation, clipping, color inversion, etc. done by the PDF content stream is ignored.
- So, normally no problem with pdf's coming directly from the scanner, but be careful with scanned and reworked pdf
- For keeping the pdf readable, I negate (color inversion) the pbm pictures [/list:u:2436i3tb]
Assuming that the API is using the installed tesseract version, there shall be no differences using the API or CL, unless there is a bug somewhere in the API
Frank
↧