I asked about this on the Tesseract forum already
Via Tesseract (and ImageMagick), I'm trying to find out the text of this PDF file
This is the section of the PDF that I'm working on, it's line #7 of the PDF:
In this section, Tesseract is running into problems when trying to identify the string CONSTRUCTORA.
It sees CO NSTRUCTO RA
It should see CONSTRUCTORA
Can anyone suggest any possible fixes for this?
This is the commandline sequence:
convert -density 600 my_pdf.pdf tmp.tif
tesseract -l spa tmp.tif stdout > tmp.txt
These are the software versions:
~% tesseract --version
tesseract 3.05.01
leptonica-1.74.4
libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 :
libtiff 4.0.3 : zlib 1.2.8
~% convert --version
Version: ImageMagick 6.7.7-10 2014-08-28 Q16 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2012 ImageMagick Studio LLC
Features: OpenMP