parse PDF with iTextSharp and then extract specifi

2019-09-20 14:00发布

问题:

So I am trying to extract from the PDF file certain content. So it is an invoice, I want to be able to search the PDF file for the word "Invoice Number:" and then "First Name" and extract them in the

Console.WriteLine();

So at the moment this is what I got and I need to figure out how to move further.

using iTextSharp.text.pdf;
using System.IO;
using iTextSharp.text.pdf.parser;
using System;

namespace PdfProperties
{
    class Program
    {
        static void Main(string[] args)
        {
            PdfReader reader = new PdfReader("C:/PDF/invoiceDetail.pdf");
            PdfReaderContentParser parser = new PdfReaderContentParser(reader);
            FileStream fs = new FileStream("C:/PDF/result0.txt", FileMode.Create);
            StreamWriter sw = new StreamWriter(fs);

            SimpleTextExtractionStrategy strategy;

            string text = "";

            for (int i = 1; i <= reader.NumberOfPages; i++)
            {
                strategy = parser.ProcessContent(i, new SimpleTextExtractionStrategy());
                sw.WriteLine(strategy.GetResultantText());

                text = strategy.GetResultantText();

                String[] splitText = text.Split(new char[] {'.' });

                Console.WriteLine("Test");

                Console.WriteLine(text);
            }
            sw.Flush();
            sw.Close();

        }
    }
}

Any help would be greatly appreciated

回答1:

Hy you could try this:

String[] splitText = text.Split(".");
for(int i =0; i<splitText.Lenght;i++)
{
if(splitText[i].toString() =="Invoice Number:")
(
  // we have Invoice Number

 // now we search for First Name
   if(splitText[i].toString() == "First Name")
   (
     // now we have also First Name
   ) 
) 
}


回答2:

There are 2 ways of going about this:

  1. You can try to process the invoice yourself. That means handling structure, and dealing with edge-cases. What if the content isn't always aligned in the same way? What if the template of the invoice changes? What if some text in the invoice is variable and you can't really rely on the precise text being extracted? ..

    This is, in short, not a trivial problem to solve.

  2. Use pdf2Data. It was specifically designed to handle documents that are rich in structure. Like invoices. It uses a concept called "selectors" that allow you to define where you expect certain content to be. Either by position (somewhere in the rectangle defined by coordinates ..) or by structural blocks (row .. from this table) etc.

    Even though the add-on is closed source, you can always try it out by using a trial-license. After evaluating pdf2Data, you can at least make a more informed decision about which route you're willing to take to tackle this problem.

    Check out itextpdf.com/itext7/pdf2Data for more information