Text extraction from a PDF using iText7. How to im

2019-06-10 11:12发布

问题:

Currently, I use this code to extract text from a Rectangle (area).

public static class ReaderExtensions
{
    public static string ExtractText(this PdfPage page, Rectangle rect)
    {
        var filter = new IEventFilter[1];
        filter[0] = new TextRegionEventFilter(rect);
        var filteredTextEventListener = new FilteredTextEventListener(new LocationTextExtractionStrategy(), filter);
        var str = PdfTextExtractor.GetTextFromPage(page, filteredTextEventListener);
        return str;
    }
}

It works, but I don't know if it's the best way to do it.

Also, I wonder if the GetTextFromPage could be improved by the iText team to increase its performance, since I'm processing hundreds of pages in big PDFs and it usually takes more than 10 minutes to do it using my current configuration.

EDIT:

From the comments: It seems that iText can extract the text of multiple rectangles on the same page in one pass, something that can improve the performance (batched operations tend to be more efficient), but how?

MORE DETAILS!

My goal is to extract data from a PDF with multiple pages. Each page has the same layout: a table with rows and columns.

Currently, I'm using the method above to extract the text of each rectangle. But, as you see, the extraction isn't batched. It's only a rectangle at a time. How could I extract all the rectangles of a page in a single pass?

回答1:

As already mentioned in a comment, I was surprised to see that the iText 7 LocationTextExtractionStrategy does not anymore contain something akin to the iText 5 LocationTextExtractionStrategy method GetResultantText(TextChunkFilter). This would have allowed you to parse the page once and extract text from text pieces in arbitrary page areas out of the box.

But it is possible to bring back that feature. One option for this would be to add it to a copy of the LocationTextExtractionStrategy. This would be kind of a long answer here, though. So I used another option: I use the existing LocationTextExtractionStrategy, and merely for the GetResultantText call I manipulate the underlying list of text chunks of the strategy. Instead of a generic TextChunkFilter interface I restricted filtering to the criteria at hand, the filtering by rectangular area.

public static class ReaderExtensions
{
    public static string[] ExtractText(this PdfPage page, params Rectangle[] rects)
    {
        var textEventListener = new LocationTextExtractionStrategy();
        PdfTextExtractor.GetTextFromPage(page, textEventListener);
        string[] result = new string[rects.Length];
        for (int i = 0; i < result.Length; i++)
        {
            result[i] = textEventListener.GetResultantText(rects[i]);
        }
        return result;
    }

    public static String GetResultantText(this LocationTextExtractionStrategy strategy, Rectangle rect)
    {
        IList<TextChunk> locationalResult = (IList<TextChunk>)locationalResultField.GetValue(strategy);
        List<TextChunk> nonMatching = new List<TextChunk>();
        foreach (TextChunk chunk in locationalResult)
        {
            ITextChunkLocation location = chunk.GetLocation();
            Vector start = location.GetStartLocation();
            Vector end = location.GetEndLocation();
            if (!rect.IntersectsLine(start.Get(Vector.I1), start.Get(Vector.I2), end.Get(Vector.I1), end.Get(Vector.I2)))
            {
                nonMatching.Add(chunk);
            }
        }
        nonMatching.ForEach(c => locationalResult.Remove(c));
        try
        {
            return strategy.GetResultantText();
        }
        finally
        {
            nonMatching.ForEach(c => locationalResult.Add(c));
        }
    }

    static FieldInfo locationalResultField = typeof(LocationTextExtractionStrategy).GetField("locationalResult", BindingFlags.NonPublic | BindingFlags.Instance);
}

The central extension is the LocationTextExtractionStrategy extension which takes a LocationTextExtractionStrategy which already contains the information from a page, restricts these information to those in a given rectangle, extracts the text, and returns the information to the previous state. This requires some reflection; I hope that is ok for you.



标签: c# .net pdf itext