I have a pdf that was generated from scanning software. The pdf has 1 TIFF image per page. I want to extract the TIFF image from each page.
I am using iTextSharp and I have successfully found the images and can get back the raw bytes from the PdfReader.GetStreamBytesRaw
method. The problem is, as many before me have discovered, iTextSharp does not contain a PdfReader.CCITTFaxDecode
method.
What else do I know? Even without iTextSharp I can open the pdf in notepad and find the streams with /Filter /CCITTFaxDecode
and I know from the /DecodeParams
that it is using CCITTFaxDecode group 4.
Does anyone out there know how I can get the CCITTFaxDecode filter images out of my pdf?
Cheers,
Kahu
Actually, vbcrlfuser's answer did help me, but the code was not quite correct for the current version of BitMiracle.LibTiff.NET, as I could download it. In the current version, equivalent code looks like this:
using iTextSharp.text.pdf;
using BitMiracle.LibTiff.Classic;
...
Tiff tiff = Tiff.Open("C:\\test.tif", "w");
tiff.SetField(TiffTag.IMAGEWIDTH, UInt32.Parse(pd.Get(PdfName.WIDTH).ToString()));
tiff.SetField(TiffTag.IMAGELENGTH, UInt32.Parse(pd.Get(PdfName.HEIGHT).ToString()));
tiff.SetField(TiffTag.COMPRESSION, Compression.CCITTFAX4);
tiff.SetField(TiffTag.BITSPERSAMPLE, UInt32.Parse(pd.Get(PdfName.BITSPERCOMPONENT).ToString()));
tiff.SetField(TiffTag.SAMPLESPERPIXEL, 1);
tiff.WriteRawStrip(0, raw, raw.Length);
tiff.Close();
Using the above code, I finally got a valid Tiff file in C:\test.tif. Thank you, vbcrlfuser!
This library... http://www.bitmiracle.com/libtiff/ and this example below should get you 99% of the way there
string filter = pd.Get(PdfName.FILTER).ToString();
string width = pd.Get(PdfName.WIDTH).ToString();
string height = pd.Get(PdfName.HEIGHT).ToString();
string bpp = pd.Get(PdfName.BITSPERCOMPONENT).ToString();
switch (filter)
{
case "/CCITTFaxDecode":
byte[] data = PdfReader.GetStreamBytesRaw((PRStream)pdfStream);
int tiff = TIFFOpen("example.tif", "w");
TIFFSetField(tiff, (uint)BitMiracle.LibTiff.Classic.TiffTag.IMAGEWIDTH,(uint)Int32.Parse(width));
TIFFSetField(tiff, (uint)BitMiracle.LibTiff.Classic.TiffTag.IMAGEHEIGHT, (uint)Int32.Parse(height));
TIFFSetField(tiff, (uint)BitMiarcle.LibTiff.Classic.TiffTag.COMPRESSION, (uint)BitMiracle.Libtiff.Classic.Compression.CCITTFAX4);
TIFFSetField(tiff, (uint)BitMiracle.LibTiff.Classic.TiffTag.BITSPERSAMPLE, (uint)Int32.Parse(bpp));
TIFFSetField(tiff, (uint)BitMiarcle.Libtiff.Classic.TiffTag.SAMPLESPERPIXEL,1 );
IntPtr pointer = Marshal.AllocHGlobal(data.length);
Marshal.copy(data, 0, pointer, data.length);
TIFFWriteRawStrip(tiff, 0, pointer, data.length);
TIFFClose(tiff);
break;
break;
}
Here is python implementation:
import PyPDF2
import struct
"""
Links:
PDF format: http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf
CCITT Group 4: https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-T.6-198811-I!!PDF-E&type=items
Extract images from pdf: http://stackoverflow.com/questions/2693820/extract-images-from-pdf-without-resampling-in-python
Extract images coded with CCITTFaxDecode in .net: http://stackoverflow.com/questions/2641770/extracting-image-from-pdf-with-ccittfaxdecode-filter
TIFF format and tags: http://www.awaresystems.be/imaging/tiff/faq.html
"""
def tiff_header_for_CCITT(width, height, img_size, CCITT_group=4):
tiff_header_struct = '<' + '2s' + 'h' + 'l' + 'h' + 'hhll' * 8 + 'h'
return struct.pack(tiff_header_struct,
b'II', # Byte order indication: Little indian
42, # Version number (always 42)
8, # Offset to first IFD
8, # Number of tags in IFD
256, 4, 1, width, # ImageWidth, LONG, 1, width
257, 4, 1, height, # ImageLength, LONG, 1, lenght
258, 3, 1, 1, # BitsPerSample, SHORT, 1, 1
259, 3, 1, CCITT_group, # Compression, SHORT, 1, 4 = CCITT Group 4 fax encoding
262, 3, 1, 0, # Threshholding, SHORT, 1, 0 = WhiteIsZero
273, 4, 1, struct.calcsize(tiff_header_struct), # StripOffsets, LONG, 1, len of header
278, 4, 1, height, # RowsPerStrip, LONG, 1, lenght
279, 4, 1, img_size, # StripByteCounts, LONG, 1, size of image
0 # last IFD
)
pdf_filename = 'scan.pdf'
pdf_file = open(pdf_filename, 'rb')
cond_scan_reader = PyPDF2.PdfFileReader(pdf_file)
for i in range(0, cond_scan_reader.getNumPages()):
page = cond_scan_reader.getPage(i)
xObject = page['/Resources']['/XObject'].getObject()
for obj in xObject:
if xObject[obj]['/Subtype'] == '/Image':
"""
The CCITTFaxDecode filter decodes image data that has been encoded using
either Group 3 or Group 4 CCITT facsimile (fax) encoding. CCITT encoding is
designed to achieve efficient compression of monochrome (1 bit per pixel) image
data at relatively low resolutions, and so is useful only for bitmap image data, not
for color images, grayscale images, or general data.
K < 0 --- Pure two-dimensional encoding (Group 4)
K = 0 --- Pure one-dimensional encoding (Group 3, 1-D)
K > 0 --- Mixed one- and two-dimensional encoding (Group 3, 2-D)
"""
if xObject[obj]['/Filter'] == '/CCITTFaxDecode':
if xObject[obj]['/DecodeParms']['/K'] == -1:
CCITT_group = 4
else:
CCITT_group = 3
width = xObject[obj]['/Width']
height = xObject[obj]['/Height']
data = xObject[obj]._data # sorry, getData() does not work for CCITTFaxDecode
img_size = len(data)
tiff_header = tiff_header_for_CCITT(width, height, img_size, CCITT_group)
img_name = obj[1:] + '.tiff'
with open(img_name, 'wb') as img_file:
img_file.write(tiff_header + data)
#
# import io
# from PIL import Image
# im = Image.open(io.BytesIO(tiff_header + data))
pdf_file.close()
It has been written extension for this (c#).
PdfDictionary item;
if (item.IsImage()) {
Image image = item.ToImage();
}