I have a PDF file, which contains data that we need to import into a database. The files seem to be pdf scans of printed alphanumeric text. Looks like 10 pt. Times New Roman.
Are there any tools or components that can will allow me to recognize and parse this text?
A quick google search shows this promising result. http://www.pdftron.com/net/index.html
Based on Mark Brackett's answer, I created a Nuget package to wrap pdftotext.
It's open source, targeting .net standard 1.6 and .net framework 4.5.
Usage:
At a company I used to work for, we used ActivePDF toolkit with some success:
http://www.activepdf.com/products/serverproducts/toolkit/index.cfm
I think you'd need at least the Standard or Pro version but they have trials so you can see if it'll do what you want it to.
I have posted about parsing pdf's in one of my blogs. Hit this link:
http://devpinoy.org/blogs/marl/archive/2008/03/04/pdf-to-text-using-open-source-library-pdfbox-another-sample-for-grade-1-pupils.aspx
Edit: Link no long works. Below quoted from http://web.archive.org/web/20130507084207/http://devpinoy.org/blogs/marl/archive/2008/03/04/pdf-to-text-using-open-source-library-pdfbox-another-sample-for-grade-1-pupils.aspx