Python - Pytesseract extracts incorrect text from

I used the below code in Python to extract text from image,

import cv2
import numpy as np
import pytesseract
from PIL import Image

# Path of working folder on Disk
src_path = "<dir path>"

def get_string(img_path):
    # Read image with opencv
    img = cv2.imread(img_path)

    # Convert to gray
    img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    # Apply dilation and erosion to remove some noise
    kernel = np.ones((1, 1), np.uint8)
    img = cv2.dilate(img, kernel, iterations=1)
    img = cv2.erode(img, kernel, iterations=1)

    # Write image after removed noise
    cv2.imwrite(src_path + "removed_noise.png", img)

    #  Apply threshold to get image with only black and white
    #img = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 31, 2)

    # Write the image after apply opencv to do some ...

    cv2.imwrite(src_path + "thres.png", img)

    # Recognize text with tesseract for python
    result = pytesseract.image_to_string(Image.open(img_path))#src_path+ "thres.png"))

    # Remove template file
    #os.remove(temp)

    return result


print '--- Start recognize text from image ---'
print get_string(src_path + "test.jpg")

print "------ Done -------"

But the output is incorrect.. The input file is,

The output received is '0001' instead of 'D001'

The output received is '3001' instead of 'B001'

What is the required code changes to retrieve the right Characters from image, also to train the pytesseract to return the right characters for all font types in image[including Bold characters]

标签： python opencv image-processing opencv-python

2条回答

祖国的老花朵

2楼-- · 2019-07-29 18:53

Try different config parameters in below line

result = pytesseract.image_to_string(Image.open(img_path))#src_path+ "thres.png"))

Like as shown below:

result = pytesseract.image_to_string(Image.open(img_path))#src_path+ "thres.png"), config='--psm 1 --oem 3')

Try to change the psm value and compare the results

-- Good Luck --

0人赞添加讨论(0) 举报

萌系小妹纸

3楼-- · 2019-07-29 18:58

@Maaaaa has pointed out the exact reason for incorrect text recognition by Tessearact.

But still you can improve your final output by applying some post processing steps on the tesseract output. Here are a few points that you can think about and use them if it helps:

Try disabling the dictionary check feature in Tesseract input parameters.
Use heuristic based information from your dataset. From the given sample images in question, i guess first character of each word/sequence is an alphabet so you can replace first digit in your output with most probable alphabet based on your dataset, for example '0' can be replaced with D so '0001' -> 'D001', similarly for other cases too.
Tesseract also provides the character level recognition confidence value, so use that information to replace the characters with the one having highest confidence value.

0人赞添加讨论(0) 举报

Python - Pytesseract extracts incorrect text from

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间