A friend and I are interested in training the tesseract-OCR engine for a CV project. We tried using some wrappers such as PyTesser and pyocr, but the results are currently not as accurate as we need them to be. As such, we want to try training the tesseract to perform better for our purposes (i.e. identifying text on food labels), but are having some trouble installing the training tools.
What we've tried:
Looking on the google code website, the 'Compiling' page on the tesseract's google code wiki says the training tools are only available on version 3.03. However, the google code 'Downloads' page for tesseract-ocr only has the materials for 3.02. The bottom of the 'Compiling' page also has some comments about installing version 3.03 on Windows and OSX, but no comments yet for Linux users.
There also appears to be some sort of 3.03 source package for Ubuntu but we're not sure how to access it on our computers and the 'Compiling' page says we need to run these commands:
make training
sudo make training-install
We've also found a google group thread about tesseract 3.03 but again it seems like these posts do not include advice for Linux users (unless we missed something during the initial read).
Is this actually a really simple command-line install problem? Or, is there a way train tesseract with 3.02 (which we currently have installed)? Have we been looking at the wrong places for information?
Any advice or links to instructions for installing tesseract-ocr 3.03 for Linux distributions would be greatly appreciated! Thanks.
Tesseract can directly be installed in Ubuntu 14.04 using
sudo apt-get install tesseract-ocr
I don't have any idea if you can do it in older version of Ubuntu because the repo might be updated in later version of Ubuntu.
I had an aws ubuntu 14.04 instance.
when I tried installing Tesseract with
sudo apt-get install tesseract-ocr
It retuned package not found
But this worked for me.
sudo apt-get update
sudo apt-get install tesseract-ocr
Ubuntu is a debian based Linux distribution. The tesseract package you find will most likely be a debian package which will contain tesseract and the required default language files to allow you to run/train tesseract. You do NOT want the source package -- unless you just want to compile it yourself -- no need. You will not have to build tesseract, you just need to install the package. First, it appears you are new to Ubuntu, so please ready InstallingSoftware. It can be as easy as opening up an x-term and issuing the command apt-get install tesseract-pkgname
(note: that means whatever the package name is).
There is no shortcut, take the time to understand whether you have a .deb package on your box that need to be installed or whether you are installing from a remote repository. The link above explains how to handle both.
Here is a specific Ubuntu thread dealing with installing tesseract Tesseract 3.0 + Ubuntu 10.04 Installation Guide Hope that helps. Tesseract is very good software.
I don't have any instructions for building Tesseract 3.03 for Linux specifically (I'm on Mac), but here's a link to download the source code for the 3.03 release candidate: https://tesseract-ocr.googlecode.com/archive/3.03-rc1.tar.gz