Programmatically recognize text from scans in a PD

2019-03-08 08:00发布

I have a PDF file, which contains data that we need to import into a database. The files seem to be pdf scans of printed alphanumeric text. Looks like 10 pt. Times New Roman.

Are there any tools or components that can will allow me to recognize and parse this text?

标签: pdf ocr
10条回答
再贱就再见
2楼-- · 2019-03-08 08:34

A quick google search shows this promising result. http://www.pdftron.com/net/index.html

查看更多
手持菜刀,她持情操
3楼-- · 2019-03-08 08:35

Based on Mark Brackett's answer, I created a Nuget package to wrap pdftotext.

It's open source, targeting .net standard 1.6 and .net framework 4.5.

Usage:

using XpdfNet;

var pdfHelper = new XpdfHelper();

string content = pdfHelper.ToText("./pathToFile.pdf");
查看更多
爱情/是我丢掉的垃圾
4楼-- · 2019-03-08 08:42

At a company I used to work for, we used ActivePDF toolkit with some success:

http://www.activepdf.com/products/serverproducts/toolkit/index.cfm

I think you'd need at least the Standard or Pro version but they have trials so you can see if it'll do what you want it to.

查看更多
狗以群分
5楼-- · 2019-03-08 08:44

I have posted about parsing pdf's in one of my blogs. Hit this link:

http://devpinoy.org/blogs/marl/archive/2008/03/04/pdf-to-text-using-open-source-library-pdfbox-another-sample-for-grade-1-pupils.aspx

Edit: Link no long works. Below quoted from http://web.archive.org/web/20130507084207/http://devpinoy.org/blogs/marl/archive/2008/03/04/pdf-to-text-using-open-source-library-pdfbox-another-sample-for-grade-1-pupils.aspx

Well, the following is based on popular examples available on the web. What this does is "read" the pdf file and output it as a text in the rich text box control in the form. The PDFBox for .NET library can be downloaded from sourceforge.

You need to add reference to IKVM.GNU.Classpath & PDFBox-0.7.3. And also, FontBox-0.1.0-dev.dll and PDFBox-0.7.3.dll need to be added on the bin folder of your application. For some reason I can't recall (maybe it's from one of the tutorials), I also added to the bin IKVM.GNU.Classpath.dll.

On the side note, just got my copy of "Head First C#" (on Keith's suggestion) from Amazon. The book is cool! It is really written for beginners. This edition covers VS2008 and the framework 3.5.

Here you go...

/* Marlon Ribunal
 * Convert PDF To Text
 * *******************/

using System;
using System.Collections.Generic;
using System.Drawing;
using System.Windows.Forms;
using System.Drawing.Printing;
using System.IO;
using System.Text;
using System.ComponentModel.Design;
using System.ComponentModel;
using org.pdfbox.pdmodel;
using org.pdfbox.util;

namespace MarlonRibunal.iPdfToText
{
    public partial class MainForm : Form
    {
        public MainForm()
        {
            InitializeComponent(); 
        }

        void Button1Click(object sender, EventArgs e)    
        {    
            PDDocument doc = PDDocument.load("C:\\pdftoText\\myPdfTest.pdf");
            PDFTextStripper stripper = new PDFTextStripper();
            richTextBox1.Text=(stripper.getText(doc));
        }

     }
}
查看更多
登录 后发表回答