Python opencv remove noise from captcha

2020-07-26 11:22发布

问题:

I need to resolve captcha automatically to grab the public data from sites.

I use python and opencv. I'm newbee in solving the images processing. After search, as a method to resolve captcha I came up with next. As the text in Captha uses group of related colours I try to use the HSV format and mask, then convert image to Grayscale and use Threshold (Adaptive_THRESH_MEAN_C) to remove noise from the image.

But this is not enough to remove noise and provide automatic text recognition with OCR (Tesseract). See images below.

Is there something I can improve in my solution or there is a better way?

Original images:

Processed images:

Code:

import cv2
import numpy as np

img = cv2.imread("1.jpeg")
hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)

mask = cv2.inRange(hsv, (36, 0, 0), (70, 255,255)) #green
# mask = cv2.inRange(hsv, (0, 0, 0), (10, 255, 255))
# mask = cv2.inRange(hsv, (125, 0, 0), (135, 255,255))

img = cv2.bitwise_and(img, img, mask=mask)
img[np.where((img == [0,0,0]).all(axis = 2))] = [255,255,255]

img = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
img = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY, 15, 2)

cv2.imwrite("out.png", img)

回答1:

I think you can reach a good performance by applying some smoothing methods and after that finding image edges. Here is the code:

import cv2

img = cv2.imread("input.jpg")
# smoothing the image
img = cv2.medianBlur(img, 5)

#edge detection    
edges = cv2.Canny(img, 100, 200)
cv2.imwrite('output.png', edges)



回答2:

You can try different approaches to achieve your goal: Your first image can be processed via the application of a median filter (r=2), followed by adaptive thresholding:

The binary option of Opening would be another option one could try: .

Note that the quality is lower than with the first approach (especially the last G is visibily degraded).

The second image responds different to the treatment than the first one:

For the median approach:

For opening:

However, it is possible to extract the text via the application of a median blur (r=1), followed by auto-contrast and then thresholding with 50:

As you can see, it is possible to improve the quality of your images enough be recognizable. The first image can be converted to text without problem, but the second one can only be recognized partially.