Extraction of images present inside a paragraph

2019-01-20 18:12发布

问题:

I am building an application where i need to parse a pdf which is generated by a system and with that parsed information i need to populate my applications database columns but unfortunaltely the pdf structure that i am dealing with is having a column called comments which has both text and image. I found the way of reading the images and text separately from the pdf but my ultimate aim was to add a place holder something like {2} in the place of image inside the parsed content and whenever my parser ( the application code ) parse this line the system will render the appropriate image in that area which is also stored in a separate table inside my application. Please help me with resolving this problem.

Thanks in advance.

回答1:

As already mentioned in comments, a solution would be to essentially use a customized text extraction strategy to insert a "[ 2]" text chunk at the coordinates of the image.

Code

You can e.g. extend the LocationTextExtractionStrategy like this:

class SimpleMixedExtractionStrategy extends LocationTextExtractionStrategy
{
    SimpleMixedExtractionStrategy(File outputPath, String name)
    {
        this.outputPath = outputPath;
        this.name = name;
    }

    @Override
    public void renderImage(final ImageRenderInfo renderInfo)
    {
        try
        {
            PdfImageObject image = renderInfo.getImage();
            if (image == null) return;
            int number = counter++;
            final String filename = String.format("%s-%s.%s", name, number, image.getFileType());
            Files.write(new File(outputPath, filename).toPath(), image.getImageAsBytes());

            LineSegment segment = UNIT_LINE.transformBy(renderInfo.getImageCTM());
            TextChunk location = new TextChunk("[" + filename + "]", segment.getStartPoint(), segment.getEndPoint(), 0f);

            Field field = LocationTextExtractionStrategy.class.getDeclaredField("locationalResult");
            field.setAccessible(true);
            List<TextChunk> locationalResult = (List<TextChunk>) field.get(this);
            locationalResult.add(location);
        }
        catch (IOException | NoSuchFieldException | SecurityException | IllegalArgumentException | IllegalAccessException ioe)
        {
            ioe.printStackTrace();
        }
    }

    final File outputPath;
    final String name; 
    int counter = 0;

    final static LineSegment UNIT_LINE = new LineSegment(new Vector(0, 0, 1) , new Vector(1, 0, 1));
}

(Unfortunately for this kind of work, some members of LocationTextExtractionStrategy are private. Thus, I used some Java reflection. Alternatively you can copy the whole class and change your copy accordingly.)

Example

Using that strategy you can extract mixed contents like this:

@Test
public void testSimpleMixedExtraction() throws IOException
{
    InputStream resourceStream = getClass().getResourceAsStream("book-of-vaadin-page14.pdf");
    try
    {
        PdfReader reader = new PdfReader(resourceStream);
        PdfReaderContentParser parser = new PdfReaderContentParser(reader);
        SimpleMixedExtractionStrategy listener = new SimpleMixedExtractionStrategy(OUTPUT_PATH, "book-of-vaadin-page14");
        parser.processContent(1, listener);
        Files.write(new File(OUTPUT_PATH, "book-of-vaadin-page14.txt").toPath(), listener.getResultantText().getBytes());
    }
    finally
    {
        if (resourceStream != null)
            resourceStream.close();
    }
}

E.g. for my test file (which contains page 14 of the Book of Vaadin):

You get this text

Getting Started with Vaadin
• A version of Book of Vaadin that you can browse in the Eclipse Help system.
You can install the plugin as follows:
1. Start Eclipse.
2. Select Help   Software Updates....
3. Select the Available Software tab.
4. Add the Vaadin plugin update site by clicking Add Site....
[book-of-vaadin-page14-0.png]
Enter the URL of the Vaadin Update Site: http://vaadin.com/eclipse and click OK. The
Vaadin site should now appear in the Software Updates window.
5. Select all the Vaadin plugins in the tree.
[book-of-vaadin-page14-1.png]
Finally, click Install.
Detailed and up-to-date installation instructions for the Eclipse plugin can be found at http://vaad-
in.com/eclipse.
Updating the Vaadin Plugin
If you have automatic updates enabled in Eclipse (see Window   Preferences   Install/Update
  Automatic Updates), the Vaadin plugin will be updated automatically along with other plugins.
Otherwise, you can update the Vaadin plugin (there are actually multiple plugins) manually as
follows:
1. Select Help   Software Updates..., the Software Updates and Add-ons window will
open.
2. Select the Installed Software tab.
14 Vaadin Plugin for Eclipse

and two images book-of-vaadin-page14-0.png

and book-of-vaadin-page14-1.png

in OUTPUT_PATH.

Improvements to make

As also already mentioned in comments, this solution is for the easy situation in which the image has text above and/or below but neither left nor right.

If there is text left and/or right, too, there is the problem that the code above calculates LineSegment segment as the bottom line of the image but the text strategy usually works with the base line of text which is above the bottom line.

But in this case one first has to decide at which position on which line one wants the marker in the text to be anyways. Having decided that, one can adapt the source above.



标签: itext