Quantcast
Channel: FiveTech Software tech support forums
Viewing all articles
Browse latest Browse all 26212

OCR for scanned documents

$
0
0
Hi Reinaldo, here my experiences with tesseract so far, perhaps it helps in some cases: [list:2436i3tb]- I want direct output of searchable pdf - not only hOCR -, possible with version 3.03 and higher - I want to run tesseract on windows engines, so I was looking after windows bins version 3.03 or higher - Found them here [url:2436i3tb]https://github.com/UB-Mannheim/tesseract/wiki[/url:2436i3tb] - My first and last experience with tiff was this 'Error in pixReadFromTiffStream: spp not in set {1,3,4}' - So I've changed to png with good results, as recommended here: [url:2436i3tb]http://stackoverflow.com/questions/5083492/tesseract-and-tiff-format-spp-not-in-set-1-3[/url:2436i3tb]: 'Tesseract (well, Leptonica) accepts PNGs these days and is less picky about them, so it might be easier to migrate your workflow to PNG anyway.' - Later on, I changed picture splitting to PDFImages, it generates ppm/pbm and optional jpg pictures [b:2436i3tb]on the fly[/b:2436i3tb]. - I experienced good OCR results, perhaps due to this fact: 'pdfimages extracts the raw image data from the PDF file, without performing any additional transforms. Any rotation, clipping, color inversion, etc. done by the PDF content stream is ignored. - So, normally no problem with pdf's coming directly from the scanner, but be careful with scanned and reworked pdf - For keeping the pdf readable, I negate (color inversion) the pbm pictures [/list:u:2436i3tb] Assuming that the API is using the installed tesseract version, there shall be no differences using the API or CL, unless there is a bug somewhere in the API Frank

Viewing all articles
Browse latest Browse all 26212

Trending Articles