This isn't really "OCR", since it's not recognizing characters, but it's the same idea applied to curves. Anyone know of an image-processing library or established algorithm for retrieving the values from a (raster) plot image? For instance, in this graph, it's hard for me to read exact values with my eyes because there's such gaps between gridlines:
alt text http://i35.tinypic.com/316airl.jpg
I can use a straight edge or whatever, but it's still going to be error-prone. It would be great if there were software that could just take a screenshot of any old graph and automatically convert it into a table of values or a function that could be queried.
Seems to be called "curve recognition"? Could also be used for extracting data from the curves in scientific papers for which the underlying data is not published.
And it's ok to have some human guidance. There's no reason an OCR couldn't read the "100" and match it up with the line, for instance, but it's ok to have a human give the lines numerical values after the machine has extracted the curve's path relative to the gridlines. I'm mostly interested in the function of tracing the curve relative to the grid, even if the grid is tilted, rotated, or warped in a non-affine way.
Update:
There is now a Wikipedia article called Converting scanned graphs to data with a bunch of software in the links. Also some software on alternativeto.net. I guess the theory belongs on http://dsp.stackexchange.com now, while the software solutions belong on http://superuser.com?
This is extremely hard and error-prone. (We do this sort of thing a lot in chemistry where we try to analyze chemistry.) It depends critically on various parameters and conditions.
- Is the image a bit-map (pixels-only) or vectors (EMF, WMF, SVG, PS, PDF...)? Vectors are vastly better than pixels. We tackle vectors (including PDF) but don't touch pixels. Some of our collbaorators will try to use pixels but only on fairly recent documents.
- If you are stuck with pixels then are your images all from the same source? If so you have a small chance of extracting font information. I am afraid your image is so poor that it would require a great deal of work. However if you can work out the font you have a chance of extracting text and numbers if all docs are from the same source. You could use heuristics (rules such as where the numbers might be) or machine-learning (a list of features on whioch the methods can be trained).
- Your image appears to have been scanned (as the axes are pixelated). That makes it even worse. What appears a straight line to the eye is horrible for a machine. Is your image skewed on the page? You may have to deskew it.
- If you have a model for the lines and curves then you may have a change of modelling expected parameters into the image. But it's not trivial.
I'm sorry to be pessimistic. If you really want the info then it can be done with a lot of investment or collaboration with groups which do this sort of thing.
google for "curve recognition software" suggests http://www.curveunscan.com/
http://www.digitizeit.de/ is a program for digitizing graphs.
There is also potrace which is related, and that page in turn mentions other alternatives
I don't know of any software that does what you're asking, but if you can get just a few points you can use some kind of regression to find the best function that fits those points. This particular graph looks like an exponential function. So you'd want to find an exponential regression calculator.
I use im2graph to convert graph images to data, that is, numbers. im2graph is free and available for Linux and Windows. Very smooth and requires very little effort on your part to generate results.
See http://www.im2graph.co.il