How to keep Tesseract from inserting extra whitesp

2019-02-28 08:57发布

问题:

I asked about this on the Tesseract forum already

Via Tesseract (and ImageMagick), I'm trying to find out the text of this PDF file

This is the section of the PDF that I'm working on, it's line #7 of the PDF:

In this section, Tesseract is running into problems when trying to identify the string CONSTRUCTORA.

It sees CO NSTRUCTO RA

It should see CONSTRUCTORA

Can anyone suggest any possible fixes for this?

This is the commandline sequence:

convert -density 600 my_pdf.pdf tmp.tif 
tesseract -l spa tmp.tif stdout > tmp.txt 

These are the software versions:

~% tesseract --version 
tesseract 3.05.01 
leptonica-1.74.4 
  libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : 
libtiff 4.0.3 : zlib 1.2.8 
~% convert --version 
Version: ImageMagick 6.7.7-10 2014-08-28 Q16 http://www.imagemagick.org 
Copyright: Copyright (C) 1999-2012 ImageMagick Studio LLC 
Features: OpenMP 

回答1:

For dealing with the irregular kerning of the PDF file, Will suggested tweaking the parameters around tosp_min_sane_kn_sp of the docs https://github.com/naptha/tesseract.js/blob/master/docs/tesseract_parameters.md

Setting tosp_min_sane_kn_sp=2.8 solved the issue that was described in the question.

The new Tesseract invocation is the following:

tesseract -c tosp_min_sane_kn_sp=2.8 -l spa tmp.tif stdout > tmp.txt

The default value for tosp_min_sane_kn_sp seems to be 1.5. So far, I have only tested with values larger than 1.5.