I want to get all objects except text object as an

2019-06-14 19:32发布

问题:

I am developing a program to convert PDF to PPTX for specific reasons using iTextSharp. What I've done so far is to get all text objects and image objects and locations. But I'm feeling difficult to get Table objects without texts. Actually it would be better if I can get them as images. My plan is to merge all objects except text objects as a background image and put text objects at proper locations. I tried to find similar questions here but no luck so far. If anyone knows how to do this particular job, please answer. Thanks.

回答1:

You say

What I've done so far is to get all text objects and image objects and locations.

but you don't go into detail how you do so. I assume you use a matching IRenderListener implementation.

But IRenderListener, as you found out yourself,

only extracts images and texts.

The main missing objects are paths and their usages.

To extract them, too, you should implement IExtRenderListener which extends IRenderListener but also retrieves information about paths. To understand the callback methods, please first be aware how path related instructions work in PDFs:

  • First there are instructions for building the actual path; these instructions essentially

    • move to some position,
    • add a line to some position from the previous position,
    • add a Bézier curve to some position from the previous position using some control points, or
    • add an upright rectangle at some position using some width and height information.
  • Then there is an optional instruction to intersect the current clip path with the generated path.

  • Finally, there is a drawing instruction for any combination of filling the inside of the path and stroking along the path, i.e. for doing both, either one, or neither one.

This corresponds to the callbacks you retrieve in your IExtRenderListener implementation:

/**
 * Called when the current path is being modified. E.g. new segment is being added,
 * new subpath is being started etc.
 *
 * @param renderInfo Contains information about the path segment being added to the current path.
 */
void ModifyPath(PathConstructionRenderInfo renderInfo);

is called once or more often to build the actual path, PathConstructionRenderInfo containing the actual instruction type in its Operation property (compare to the PathConstructionRenderInfo constant members MOVETO, LINETO, etc. to determine the operation type) and the required coordinates / dimensions in its SegmentData property. The Ctm property additionally returns the affine transformation that currently is set to be applied to all drawing operations.

Then

/**
 * Called when the current path should be set as a new clipping path.
 *
 * @param rule Either {@link PathPaintingRenderInfo#EVEN_ODD_RULE} or {@link PathPaintingRenderInfo#NONZERO_WINDING_RULE}
 */
void ClipPath(int rule); 

is called if the current clip path shall be intersected with the constructed path.

Finally

/**
 * Called when the current path should be rendered.
 *
 * @param renderInfo Contains information about the current path which should be rendered.
 * @return The path which can be used as a new clipping path.
 */
Path RenderPath(PathPaintingRenderInfo renderInfo); 

is called, PathPaintingRenderInfo containing the drawing operation in its Operation property (any combination of the PathPaintingRenderInfo constants STROKE and FILL), the rule for determining what "inside the path" means in its Rule property (NONZERO_WINDING_RULE or EVEN_ODD_RULE), and some other drawing details in the Ctm, LineWidth, LineCapStyle, LineJoinStyle, MiterLimit, and LineDashPattern properties.



回答2:

try to implement IRenderListener

  internal class ImageExtractor : IRenderListener
{
    private int _currentPage = 1;
    private int _imageCount = 0;
    private readonly string _outputFilePrefix;
    private readonly string _outputFolder;
    private readonly bool _overwriteExistingFiles;

    private ImageExtractor(string outputFilePrefix, string outputFolder, bool overwriteExistingFiles)
    {
        _outputFilePrefix = outputFilePrefix;
        _outputFolder = outputFolder;
        _overwriteExistingFiles = overwriteExistingFiles;
    }

    /// <summary>
    /// Extract all images from a PDF file
    /// </summary>
    /// <param name="pdfPath">Full path and file name of PDF file</param>
    /// <param name="outputFilePrefix">Basic name of exported files. If null then uses same name as PDF file.</param>
    /// <param name="outputFolder">Where to save images. If null or empty then uses same folder as PDF file.</param>
    /// <param name="overwriteExistingFiles">True to overwrite existing image files, false to skip past them</param>
    /// <returns>Count of number of images extracted.</returns>
    public static int ExtractImagesFromFile(string pdfPath, string outputFilePrefix, string outputFolder, bool overwriteExistingFiles)
    {
        // Handle setting of any default values
        outputFilePrefix = outputFilePrefix ?? System.IO.Path.GetFileNameWithoutExtension(pdfPath);
        outputFolder = String.IsNullOrEmpty(outputFolder) ? System.IO.Path.GetDirectoryName(pdfPath) : outputFolder;

        var instance = new ImageExtractor(outputFilePrefix, outputFolder, overwriteExistingFiles);

        using (var pdfReader = new PdfReader(pdfPath))
        {
            if (pdfReader.IsEncrypted())
                throw new ApplicationException(pdfPath + " is encrypted.");

            var pdfParser = new PdfReaderContentParser(pdfReader);

            while (instance._currentPage <= pdfReader.NumberOfPages)
            {
                pdfParser.ProcessContent(instance._currentPage, instance);

                instance._currentPage++;
            }
        }

        return instance._imageCount;
    }

    #region Implementation of IRenderListener

    public void BeginTextBlock() { }
    public void EndTextBlock() { }
    public void RenderText(TextRenderInfo renderInfo) { }

    public void RenderImage(ImageRenderInfo renderInfo)
    {
        if (_imageCount == 0)
        {
            var imageObject = renderInfo.GetImage();

            var imageFileName = _outputFilePrefix + _imageCount; //to get multiple file (you should add .jpg or .png ...)
            var imagePath = System.IO.Path.Combine(_outputFolder, imageFileName);



            if (_overwriteExistingFiles || !File.Exists(imagePath))
            {
                var imageRawBytes = imageObject.GetImageAsBytes();
                //create a new file ()
                File.WriteAllBytes(imagePath, imageRawBytes);

            }
        }
        _imageCount++;
    }

    #endregion // Implementation of IRenderListener

}


标签: c# pdf itext