Programmatically recognize text from scans in a PD-第2页回答

I have a PDF file, which contains data that we need to import into a database. The files seem to be pdf scans of printed alphanumeric text. Looks like 10 pt. Times New Roman.

Are there any tools or components that can will allow me to recognize and parse this text?

标签： pdf ocr

10条回答

再贱就再见

2楼-- · 2019-03-08 08:34

A quick google search shows this promising result. http://www.pdftron.com/net/index.html

0人赞添加讨论(0) 举报

手持菜刀，她持情操

3楼-- · 2019-03-08 08:35

Based on Mark Brackett's answer, I created a Nuget package to wrap pdftotext.

It's open source, targeting .net standard 1.6 and .net framework 4.5.

Usage:

using XpdfNet;

var pdfHelper = new XpdfHelper();

string content = pdfHelper.ToText("./pathToFile.pdf");

0人赞添加讨论(0) 举报

爱情/是我丢掉的垃圾

4楼-- · 2019-03-08 08:42

At a company I used to work for, we used ActivePDF toolkit with some success:

http://www.activepdf.com/products/serverproducts/toolkit/index.cfm

I think you'd need at least the Standard or Pro version but they have trials so you can see if it'll do what you want it to.

0人赞添加讨论(0) 举报

狗以群分

5楼-- · 2019-03-08 08:44

I have posted about parsing pdf's in one of my blogs. Hit this link:

http://devpinoy.org/blogs/marl/archive/2008/03/04/pdf-to-text-using-open-source-library-pdfbox-another-sample-for-grade-1-pupils.aspx

Edit: Link no long works. Below quoted from http://web.archive.org/web/20130507084207/http://devpinoy.org/blogs/marl/archive/2008/03/04/pdf-to-text-using-open-source-library-pdfbox-another-sample-for-grade-1-pupils.aspx

Well, the following is based on popular examples available on the web. What this does is "read" the pdf file and output it as a text in the rich text box control in the form. The PDFBox for .NET library can be downloaded from sourceforge.

You need to add reference to IKVM.GNU.Classpath & PDFBox-0.7.3. And also, FontBox-0.1.0-dev.dll and PDFBox-0.7.3.dll need to be added on the bin folder of your application. For some reason I can't recall (maybe it's from one of the tutorials), I also added to the bin IKVM.GNU.Classpath.dll.

On the side note, just got my copy of "Head First C#" (on Keith's suggestion) from Amazon. The book is cool! It is really written for beginners. This edition covers VS2008 and the framework 3.5.

Here you go...

/* Marlon Ribunal
 * Convert PDF To Text
 * *******************/

using System;
using System.Collections.Generic;
using System.Drawing;
using System.Windows.Forms;
using System.Drawing.Printing;
using System.IO;
using System.Text;
using System.ComponentModel.Design;
using System.ComponentModel;
using org.pdfbox.pdmodel;
using org.pdfbox.util;

namespace MarlonRibunal.iPdfToText
{
    public partial class MainForm : Form
    {
        public MainForm()
        {
            InitializeComponent(); 
        }

        void Button1Click(object sender, EventArgs e)    
        {    
            PDDocument doc = PDDocument.load("C:\\pdftoText\\myPdfTest.pdf");
            PDFTextStripper stripper = new PDFTextStripper();
            richTextBox1.Text=(stripper.getText(doc));
        }

     }
}

0人赞添加讨论(0) 举报

上一页 1 2

Programmatically recognize text from scans in a PD

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间