I need to automatically remove the mildly colored background of a scanned document image for OCR.
ScanTailor is an open source C++ GUI-based app that does background separation among other things, but I cannot figure out how to run only the last step which actually removes the background.
Ideally, I could find the code that does this and either:
- Port that part to C#
- Modify the C++ to respond to command line execution, only performing that step on a given image
Can you help me understand how I can do either?
or do you know other libraries that can do this? (any language/platform acceptable)
You are referring to Thresholding, Despeckling and Noise Removal techniques which are necessary in OCR applications.
The quality of the results depends very much an many different factors -
Print quality of the original Scan quality Image resolution Background colours and patterns used. Noise and other marks.
You may find the IEvolution.NET library at http://www.hi-components.com/nievolution.asp useful. It has many image processing functions to play with.
There are many commercial engines available. There is no one perfect function to solve image processing problems. You must adapt the functions and parameter to match your images. http://www.recogniform.com/thresholding.htm
A Google search will show up lots of results.
Maybe the algorithm is, approximately:
If it's a high-resolution low-color-depth (e.g. black-and-white) image, then you need to apply this algorithm to groups of pixels.