I want to find a easy-to-use OCR python module in linux, I have found pytesser http://code.google.com/p/pytesser/, but it contains a .exe executable file.
I tried changed the code to use wine, and it really works, but it's too slow and really not a good idea.
Is there any Linux alternatives that as easy-to-use as it?
You can just wrap tesseract
in a function:
import os
import tempfile
import subprocess
def ocr(path):
temp = tempfile.NamedTemporaryFile(delete=False)
process = subprocess.Popen(['tesseract', path, temp.name], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
process.communicate()
with open(temp.name + '.txt', 'r') as handle:
contents = handle.read()
os.remove(temp.name + '.txt')
os.remove(temp.name)
return contents
If you want document segmentation and more advanced features, try out OCRopus.
In addition to Blender's answer, that just executs Tesseract executable, I would like to add that there exist other alternatives for OCR that can also be called as external process.
ABBYY comand line OCR utility: http://ocr4linux.com/en:start
It is not free, so worth to consider only if Tesseract accuracy is not good enough for your task, or you need more sophisticated layout analisys or you need to export PDF, Word and other files.
Update: here's comparison of ABBYY and tesseract accuracy: http://www.splitbrain.org/blog/2010-06/15-linux_ocr_software_comparison
Disclaimer: I work for ABBYY
python tesseract
http://code.google.com/p/python-tesseract
import cv2.cv as cv
import tesseract
api = tesseract.TessBaseAPI()
api.Init(".","eng",tesseract.OEM_DEFAULT)
api.SetPageSegMode(tesseract.PSM_AUTO)
image=cv.LoadImage("eurotext.jpg", cv.CV_LOAD_IMAGE_GRAYSCALE)
tesseract.SetCvImage(image,api)
text=api.GetUTF8Text()
conf=api.MeanTextConf()
You should try the excellent scikits.learn libraries for machine learning. You can find two codes that are ready to run here and here.
You have a bunch of options here.
One way, as others pointed out is to use tesseract. Looks like there are a bunch of wrappers by now, so best way is to do a quick pypi search for it. The most used ones these days are:
- pytesseract
- pytesser
- tesserwrap
- pyocr
Another useful site for finding similar engines is alternative.to. A few linux based systems according to them are:
- ABBYY
- Tesseract
- CuneiForm
- Ocropus
- GOCR