I'm trying to get Tesseract to output a file with labelled bounding boxes that result from page segmentation (pre OCR). I know it must be capable of doing this 'out of the box' because of the results shown at the ICDAR competitions where contestants had to segment and various documents (academic paper here). Here's an example from that paper illustrating what I want to create:
I have built the latest version of tesseract using brew, brew install tesseract --HEAD
, and have been trying to edit config files located in /usr/local/Cellar/tesseract/HEAD/share/tessdata/configs/
to output labelled boxes. The output received using hocr
as the config, i.e.
tesseract infile.tiff outfile_stem -l eng -psm 1 hocr
gives a bounding box for everything and has some labelling in class
tags e.g.
<p class='ocr_par' dir='ltr' id='par_5_82' title="bbox 2194 4490 3842 4589">
<span class='ocr_line' id='line_5_142' ...
but I can't visualise this. Is there a standard tool to visualize hOCR files, or is the facility to create an output file with bounding boxes built into Tesseract?
The current head version details:
tesseract 3.04.00
leptonica-1.71
libjpeg 8d : libpng 1.6.16 : libtiff 4.0.3 : zlib 1.2.5
Edit
I'm really looking to achieve this using the command line tool (as in examples above). @nguyenq has pointed me to the API reference, unfortunately I have no c++ experience. If the only solution is to use the API, please can you provide a quick python example?
You can use its API to obtain the bounding boxes at various levels (character/word/line/para) -- see API Example. You have to draw the labels yourself.
With Tesseract 4.0.0, a command like
tesseract source/dir/myimage.tiff target/directory/basefilename hocr
will create abasefilename.hocr
file with block-, paragraph-, line-, and word-level bounding boxes for the OCR'ed text. Even the command without thehocr
config creates a text file with newlines between block-level text, but the hocr format is more explicit.More config options here: https://github.com/tesseract-ocr/tesseract/tree/master/tessdata/configs
The simplest way to have a HOCR file with the individual character level is use the nickjwhite's fork of Tesseract 3.05 : https://github.com/nickjwhite/tesseract/tree/hocrcharboxes
Compile and download tessdata files following the wiki of Tesseract. Once the installation check, use :
tesseract {image file} -c tessedit_create_hocr=1 -c hocr_char_boxes=1 {output name}
and tadam !
Success. Many thanks to the people at the Pattern Recognition and Image Analysis Research Lab (PRImA) for producing tools to handle this. You can obtain them freely on their website or github.
Below I give the full solution for a Mac running 10.10 and using the homebrew package manager. I use wine to run windows executables.
Overview
Code
Results
Document with overlays (rollover to see text and type) Overlays alone (use GUI buttons to toggle)
Appendix
You can run tesseract yourself and use another tool to convert its output to PAGE format. I was unable to get this to work but I'm sure you'll be fine!
At this point you need to use the PAGE Converter Java Tool to convert the HOCR xml into a PAGE xml. It should go a little something like this:
Unfortunately, I kept getting null pointers.
Shortcut
It is also possible to open HOCR files directly with the PageViewer tool. The file extension has to be .xml, however.