How to get the text position from the pdf page in

2楼-- · 2019-04-13 23:43

@Joris' answer explains how to implement a completely new extraction strategy / event listener for the task. Alternatively one can try and tweak an existing text extraction strategy to do what you required.

This answer demonstrates how to tweak the existing LocationTextExtractionStrategy to return both the text and its characters' respective y coordinates.

Beware, this is but a proof-of-concept which in particular assumes text to be written horizontally, i.e. using an effective transformation matrix (ctm and text matrix combined) with b and c equal to 0. Furthermore the character and coordinate retrieval methods of TextPlusY are not at all optimized and might take long to execute.

As the OP did not express a language preference, here a solution for iText7 for Java:

TextPlusY

For the task at hand one needs to be able to retrieve character and y coordinates side by side. To make this easier I use a class representing both text its characters' respective y coordinates. It is derived from CharSequence, a generalization of String, which allows it to be used in many String related functions:

public class TextPlusY implements CharSequence
{
    final List<String> texts = new ArrayList<>();
    final List<Float> yCoords = new ArrayList<>();

    //
    // CharSequence implementation
    //
    @Override
    public int length()
    {
        int length = 0;
        for (String text : texts)
        {
            length += text.length();
        }
        return length;
    }

    @Override
    public char charAt(int index)
    {
        for (String text : texts)
        {
            if (index < text.length())
            {
                return text.charAt(index);
            }
            index -= text.length();
        }
        throw new IndexOutOfBoundsException();
    }

    @Override
    public CharSequence subSequence(int start, int end)
    {
        TextPlusY result = new TextPlusY();
        int length = end - start;
        for (int i = 0; i < yCoords.size(); i++)
        {
            String text = texts.get(i);
            if (start < text.length())
            {
                float yCoord = yCoords.get(i); 
                if (start > 0)
                {
                    text = text.substring(start);
                    start = 0;
                }
                if (length > text.length())
                {
                    result.add(text, yCoord);
                }
                else
                {
                    result.add(text.substring(0, length), yCoord);
                    break;
                }
            }
            else
            {
                start -= text.length();
            }
        }
        return result;
    }

    //
    // Object overrides
    //
    @Override
    public String toString()
    {
        StringBuilder builder = new StringBuilder();
        for (String text : texts)
        {
            builder.append(text);
        }
        return builder.toString();
    }

    //
    // y coordinate support
    //
    public TextPlusY add(String text, float y)
    {
        if (text != null)
        {
            texts.add(text);
            yCoords.add(y);
        }
        return this;
    }

    public float yCoordAt(int index)
    {
        for (int i = 0; i < yCoords.size(); i++)
        {
            String text = texts.get(i);
            if (index < text.length())
            {
                return yCoords.get(i);
            }
            index -= text.length();
        }
        throw new IndexOutOfBoundsException();
    }
}

(TextPlusY.java)

TextPlusYExtractionStrategy

Now we extend the LocationTextExtractionStrategy to extract a TextPlusY instead of a String. All we need for that is to generalize the method getResultantText.

Unfortunately the LocationTextExtractionStrategy has hidden some methods and members (private or package protected) which need to be accessed here; thus, some reflection magic is required. If your framework does not allow this, you'll have to copy the whole strategy and manipulate it accordingly.

public class TextPlusYExtractionStrategy extends LocationTextExtractionStrategy
{
    static Field locationalResultField;
    static Method sortWithMarksMethod;
    static Method startsWithSpaceMethod;
    static Method endsWithSpaceMethod;

    static Method textChunkSameLineMethod;

    static
    {
        try
        {
            locationalResultField = LocationTextExtractionStrategy.class.getDeclaredField("locationalResult");
            locationalResultField.setAccessible(true);
            sortWithMarksMethod = LocationTextExtractionStrategy.class.getDeclaredMethod("sortWithMarks", List.class);
            sortWithMarksMethod.setAccessible(true);
            startsWithSpaceMethod = LocationTextExtractionStrategy.class.getDeclaredMethod("startsWithSpace", String.class);
            startsWithSpaceMethod.setAccessible(true);
            endsWithSpaceMethod = LocationTextExtractionStrategy.class.getDeclaredMethod("endsWithSpace", String.class);
            endsWithSpaceMethod.setAccessible(true);

            textChunkSameLineMethod = TextChunk.class.getDeclaredMethod("sameLine", TextChunk.class);
            textChunkSameLineMethod.setAccessible(true);
        }
        catch(NoSuchFieldException | NoSuchMethodException | SecurityException e)
        {
            // Reflection failed
        }
    }

    //
    // constructors
    //
    public TextPlusYExtractionStrategy()
    {
        super();
    }

    public TextPlusYExtractionStrategy(ITextChunkLocationStrategy strat)
    {
        super(strat);
    }

    @Override
    public String getResultantText()
    {
        return getResultantTextPlusY().toString();
    }

    public TextPlusY getResultantTextPlusY()
    {
        try
        {
            List<TextChunk> textChunks = new ArrayList<>((List<TextChunk>)locationalResultField.get(this));
            sortWithMarksMethod.invoke(this, textChunks);

            TextPlusY textPlusY = new TextPlusY();
            TextChunk lastChunk = null;
            for (TextChunk chunk : textChunks)
            {
                float chunkY = chunk.getLocation().getStartLocation().get(Vector.I2);
                if (lastChunk == null)
                {
                    textPlusY.add(chunk.getText(), chunkY);
                }
                else if ((Boolean)textChunkSameLineMethod.invoke(chunk, lastChunk))
                {
                    // we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space
                    if (isChunkAtWordBoundary(chunk, lastChunk) &&
                            !(Boolean)startsWithSpaceMethod.invoke(this, chunk.getText()) &&
                            !(Boolean)endsWithSpaceMethod.invoke(this, lastChunk.getText()))
                    {
                        textPlusY.add(" ", chunkY);
                    }

                    textPlusY.add(chunk.getText(), chunkY);
                }
                else
                {
                    textPlusY.add("\n", lastChunk.getLocation().getStartLocation().get(Vector.I2));
                    textPlusY.add(chunk.getText(), chunkY);
                }
                lastChunk = chunk;
            }

            return textPlusY;
        }
        catch (IllegalAccessException | IllegalArgumentException | InvocationTargetException e)
        {
            throw new RuntimeException("Reflection failed", e);
        }
    }
}

(TextPlusYExtractionStrategy.java)

Usage

Using these two classes you can extract text with coordinates and search therein like this:

try (   PdfReader reader = new PdfReader(YOUR_PDF);
        PdfDocument document = new PdfDocument(reader)  )
{
    TextPlusYExtractionStrategy extractionStrategy = new TextPlusYExtractionStrategy();
    PdfPage page = document.getFirstPage();

    PdfCanvasProcessor parser = new PdfCanvasProcessor(extractionStrategy);
    parser.processPageContent(page);
    TextPlusY textPlusY = extractionStrategy.getResultantTextPlusY();

    System.out.printf("\nText from test.pdf\n=====\n%s\n=====\n", textPlusY);

    System.out.print("\nText with y from test.pdf\n=====\n");

    int length = textPlusY.length();
    float lastY = Float.MIN_NORMAL;
    for (int i = 0; i < length; i++)
    {
        float y = textPlusY.yCoordAt(i);
        if (y != lastY)
        {
            System.out.printf("\n(%4.1f) ", y);
            lastY = y;
        }
        System.out.print(textPlusY.charAt(i));
    }
    System.out.print("\n=====\n");

    System.out.print("\nMatches of 'est' with y from test.pdf\n=====\n");
    Matcher matcher = Pattern.compile("est").matcher(textPlusY);
    while (matcher.find())
    {
        System.out.printf("from character %s to %s at y position (%4.1f)\n", matcher.start(), matcher.end(), textPlusY.yCoordAt(matcher.start()));
    }
    System.out.print("\n=====\n");
}

(ExtractTextPlusY test method testExtractTextPlusYFromTest)

For my test document

the output of the test code above is

Text from test.pdf
=====
Ein Dokumen t mit einigen
T estdaten
T esttest T est test test
=====

Text with y from test.pdf
=====

(691,8) Ein Dokumen t mit einigen

(666,9) T estdaten

(642,0) T esttest T est test test
=====

Matches of 'est' with y from test.pdf
=====
from character 28 to 31 at y position (666,9)
from character 39 to 42 at y position (642,0)
from character 43 to 46 at y position (642,0)
from character 49 to 52 at y position (642,0)
from character 54 to 57 at y position (642,0)
from character 59 to 62 at y position (642,0)

=====

My locale uses the comma as decimal separator, you might see 666.9 instead of 666,9.

The extra spaces you see can be removed by fine-tuning the base LocationTextExtractionStrategy functionality further. But that is the focus of other questions...

0人赞添加讨论(0) 举报

迷人小祖宗

3楼-- · 2019-04-13 23:53

First, SimpleTextExtractionStrategy is not exactly the 'smartest' strategy (as the name would suggest.

Second, if you want the position you're going to have to do a lot more work. TextExtractionStrategy assumes you are only interested in the text.

Possible implementation:

implement IEventListener
get notified for all events that render text, and store the corresponding TextRenderInfo object
once you're finished with the document, sort these objects based on their position in the page
loop over this list of TextRenderInfo objects, they offer both the text being rendered and the coordinates

how to:

implement ITextExtractionStrategy (or extend an existing implementation)
use PdfTextExtractor.getTextFromPage(doc.getPage(pageNr), strategy), where strategy denotes the strategy you created in step 1
your strategy should be set up to keep track of locations for the text it processed

ITextExtractionStrategy has the following method in its interface:

@Override
public void eventOccurred(IEventData data, EventType type) {

    // you can first check the type of the event
     if (!type.equals(EventType.RENDER_TEXT))
        return;

    // now it is safe to cast
    TextRenderInfo renderInfo = (TextRenderInfo) data;
}

Important to keep in mind is that rendering instructions in a pdf do not need to appear in order. The text "Lorem Ipsum Dolor Sit Amet" could be rendered with instructions similar to: render "Ipsum Do"
render "Lorem "
render "lor Sit Amet"

You will have to do some clever merging (depending on how far apart two TextRenderInfo objects are), and sorting (to get all the TextRenderInfo objects in the proper reading order.

Once that's done, it should be easy.

0人赞添加讨论(0) 举报

干净又极端

4楼-- · 2019-04-13 23:59

I was able to manipulate it with my previous version for Itext5. I don't know if you are looking for C# but that is what the below code is written in.

using iText.Kernel.Geom;
using iText.Kernel.Pdf.Canvas.Parser;
using iText.Kernel.Pdf.Canvas.Parser.Data;
using iText.Kernel.Pdf.Canvas.Parser.Listener;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

class TextLocationStrategy : LocationTextExtractionStrategy
{
    private List<textChunk> objectResult = new List<textChunk>();

    public override void EventOccurred(IEventData data, EventType type)
    {
        if (!type.Equals(EventType.RENDER_TEXT))
            return;

        TextRenderInfo renderInfo = (TextRenderInfo)data;

        string curFont = renderInfo.GetFont().GetFontProgram().ToString();

        float curFontSize = renderInfo.GetFontSize();

        IList<TextRenderInfo> text = renderInfo.GetCharacterRenderInfos();
        foreach (TextRenderInfo t in text)
        {
            string letter = t.GetText();
            Vector letterStart = t.GetBaseline().GetStartPoint();
            Vector letterEnd = t.GetAscentLine().GetEndPoint();
            Rectangle letterRect = new Rectangle(letterStart.Get(0), letterStart.Get(1), letterEnd.Get(0) - letterStart.Get(0), letterEnd.Get(1) - letterStart.Get(1));

            if (letter != " " && !letter.Contains(' '))
            {
                textChunk chunk = new textChunk();
                chunk.text = letter;
                chunk.rect = letterRect;
                chunk.fontFamily = curFont;
                chunk.fontSize = curFontSize;
                chunk.spaceWidth = t.GetSingleSpaceWidth() / 2f;

                objectResult.Add(chunk);
            }
        }
    }
}
public class textChunk
{
    public string text { get; set; }
    public Rectangle rect { get; set; }
    public string fontFamily { get; set; }
    public int fontSize { get; set; }
    public float spaceWidth { get; set; }
}

I also get down to each individual character because it works better for my process. You can manipulate the names, and of course the objects, but I created the textchunk to hold what I wanted, rather than have a bunch of renderInfo objects.

You can implement this by adding a few lines to grab the data from your pdf.

PdfDocument reader = new PdfDocument(new PdfReader(filepath));
FilteredEventListener listener = new FilteredEventListener();
var strat = listener.AttachEventListener(new TextExtractionStrat());
PdfCanvasProcessor processor = new PdfCanvasProcessor(listener);
processor.ProcessPageContent(reader.GetPage(1));

Once you are this far, you can pull the objectResult from the strat by making it public or creating a method within your class to grab the objectResult and do something with it.

0人赞添加讨论(0) 举报