Extracting pdf images in a correct order iTextShar

2019-09-04 11:54发布

问题:

I'm trying to extract images from a PDF File, but I really need to have it at the correct order to get the correct image.

    static void Main(string[] args)
    {
        string filename = "D:\\910723575_marca_coletiva.pdf";

        PdfReader pdfReader = new PdfReader(filename);

        var imagemList = ExtraiImagens(pdfReader);

        // converter byte[] para um bmp
        List<Bitmap> bmpSrcList = new List<Bitmap>();
        IList<byte[]> imagensProcessadas = new List<byte[]>();

        foreach (var imagem in imagemList)
        {

            System.Drawing.ImageConverter converter = new System.Drawing.ImageConverter();
            try
            {
                Image img = (Image)converter.ConvertFrom(imagem);
                ConsoleWriteImage(img);
                imagensProcessadas.Add(imagem);
            }
            catch (Exception)
            {
                continue;
            }

        }

        System.Console.ReadLine();
    }

    public static void ConsoleWriteImage(Image img)
    {
        int sMax = 39;
        decimal percent = Math.Min(decimal.Divide(sMax, img.Width), decimal.Divide(sMax, img.Height));
        Size resSize = new Size((int)(img.Width * percent), (int)(img.Height * percent));
        Func<System.Drawing.Color, int> ToConsoleColor = c =>
        {
            int index = (c.R > 128 | c.G > 128 | c.B > 128) ? 8 : 0;
            index |= (c.R > 64) ? 4 : 0;
            index |= (c.G > 64) ? 2 : 0;
            index |= (c.B > 64) ? 1 : 0;
            return index;
        };
        Bitmap bmpMin = new Bitmap(img, resSize.Width, resSize.Height);
        Bitmap bmpMax = new Bitmap(img, resSize.Width * 2, resSize.Height * 2);
        for (int i = 0; i < resSize.Height; i++)
        {
            for (int j = 0; j < resSize.Width; j++)
            {
                Console.ForegroundColor = (ConsoleColor)ToConsoleColor(bmpMin.GetPixel(j, i));
                Console.Write("██");
            }

            Console.BackgroundColor = ConsoleColor.Black;
            Console.Write("    ");

            for (int j = 0; j < resSize.Width; j++)
            {
                Console.ForegroundColor = (ConsoleColor)ToConsoleColor(bmpMax.GetPixel(j * 2, i * 2));
                Console.BackgroundColor = (ConsoleColor)ToConsoleColor(bmpMax.GetPixel(j * 2, i * 2 + 1));
                Console.Write("▀");

                Console.ForegroundColor = (ConsoleColor)ToConsoleColor(bmpMax.GetPixel(j * 2 + 1, i * 2));
                Console.BackgroundColor = (ConsoleColor)ToConsoleColor(bmpMax.GetPixel(j * 2 + 1, i * 2 + 1));
                Console.Write("▀");
            }
            System.Console.WriteLine();
        }
    }

    public static IList<byte[]> ExtraiImagens(PdfReader pdfReader) 
    {
        var data = new byte[] { };

        IList<byte[]> imagensList = new List<byte[]>();

        for (int numPag = 1; numPag <= 3; numPag++)
        //for (int numPag = 1; numPag <= pdfReader.NumberOfPages; numPag++)
        {
            var pg = pdfReader.GetPageN(numPag);
            var res = PdfReader.GetPdfObject(pg.Get(PdfName.RESOURCES)) as PdfDictionary;
            var xobj = PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT)) as PdfDictionary;
            if (xobj == null) continue;

            var keys = xobj.Keys;
            if (keys == null) continue;

            PdfObject obj = null;
            PdfDictionary tg = null;

            for (int key = 0; key < keys.Count; key++)
            {
                obj = xobj.Get(keys.ElementAt(key));

                if (!obj.IsIndirect()) continue;

                tg = PdfReader.GetPdfObject(obj) as PdfDictionary;

                obj = xobj.Get(keys.ElementAt(key));
                if (!obj.IsIndirect()) continue;
                tg = PdfReader.GetPdfObject(obj) as PdfDictionary;

                var type = PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE)) as PdfName;
                if (!PdfName.IMAGE.Equals(type)) continue;

                int XrefIndex = (obj as PRIndirectReference).Number;
                var pdfStream = pdfReader.GetPdfObject(XrefIndex) as PRStream;

                data = PdfReader.GetStreamBytesRaw(pdfStream);

                imagensList.Add(PdfReader.GetStreamBytesRaw(pdfStream));
            }
        }

        return imagensList;
    }
}

The method ConsoleWriteImage is only to print the image at the console and I used it to study the behavior of the order that iTextSharp was retrieving it for me , based on my code.

Any help ?

回答1:

Unfortunately the OP has not explained what the correct order is - this is not self-explanatory because there might be certain aspects of a PDF which are not obvious for a program, merely for a human reader viewing the rendered PDF.

At least, though, it is likely that the OP wants to get his images on a page-by-page basis. This obviously is not what his current code provides: His code scans the whole base of objects inside the PDF for image objects, so he will get image objects, but the order may be completely random; in particular he may even get images contained in the PDF but not used on any of its pages...

To retrieve images on a page-by-page order (and only images actually used), one should use the parser framework, e.g.

PdfReader reader = new PdfReader(pdf);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
MyImageRenderListener listener = new MyImageRenderListener();
for (int i = 1; i <= reader.NumberOfPages; i++) {
  parser.ProcessContent(i, listener);
} 
// Process images in the List listener.MyImages
// with names in listener.ImageNames

(Excerpt from the ExtractImages.cs iTextSharp example)

where MyImageRenderListener is defined to collect images:

public class MyImageRenderListener : IRenderListener {
    /** the byte array of the extracted images */
    private List<byte[]> _myImages;
    public List<byte[]> MyImages {
      get { return _myImages; }
    }
    /** the file names of the extracted images */
    private List<string> _imageNames;
    public List<string> ImageNames { 
      get { return _imageNames; }
    } 

    public MyImageRenderListener() {
      _myImages = new List<byte[]>();
      _imageNames = new List<string>();
    }

    [...]

    public void RenderImage(ImageRenderInfo renderInfo) {
      try {
        PdfImageObject image = renderInfo.GetImage();
        if (image == null || image.GetImageBytesType() == PdfImageObject.ImageBytesType.JBIG2) 
          return;

        _imageNames.Add(string.Format("Image{0}.{1}", renderInfo.GetRef().Number, image.GetFileType() ) );
        _myImages.Add(image.GetImageAsBytes());
      }
      catch
      {
      }
    }

    [...]      
}

(Excerpt from MyImageRenderListener.cs iTextSharp example)

The ImageRenderInfo renderInfo furthermore also contains information on location and orientation of the image on the page in question which might help to deduce the correct order the OP is after.



标签: c# pdf itext