Why converting this PDF file fails when using Imag

2019-08-01 15:46发布

问题:

I want to convert this PDF file compiled with LaTeX (XeLaTeX engine so that to use an Arabic font) and I want to upload it to the web and prevent copy and paste of its content.

Since I am looking for a freeware to do that, I came across two powerful beasts to do this job, namely, ImageMagick and Ghostscript. All what I need is to convert one text PDF to image PDF in one go, preferably with batch processing if possible (to convert many PDFs in one go).

I run this code in command line and it works fine for English-written PDFs:

convert someenglish.pdf output.pdf  

Now when I do the same for an Arabic PDF I get this error:

convert.exe: PDFDelegateFailed `[ghostscript library] -q -dQUIET -dSAFER -dBATCH
 -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sD
EVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72"  "-sOutputFile
=C:/Users/doctorate/AppData/Local/Temp/magick-65203BNMxTDhXtkF%d" "-fC:/Users/doctorate/Ap
pData/Local/Temp/magick-65206AK54hOoKA62" "-fC:/Users/doctorate/AppData/Local/Temp/ma
gick-6520hDn-KMyTyxy2"':    **** Error reading a content stream. The page may be
 incomplete.
   **** Incorrect object count in object stream.
Error: /rangecheck in resolveobjectstream
Operand stack:
   78424   10   1   10   --dict:7/15(L)--   26   --nostringval--   35   --nostri
ngval--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--
  --dict:4/4(L)--   --dict:4/4(L)--   --dict:4/4(L)--   --dict:4/4(L)--   --dict
:4/4(L)--   --dict:3/3(L)--   --dict:2/2(L)--   --nostringval--   --dict:7/7(L)-
-   --dict:10/10(L)--   --nostringval--   --nostringval--   Type   Font   Subtyp
e   CIDFontType2   BaseFont   MYCROL+(AH
Execution stack:
   %interp_exit   .runexec2   --nostringval--   --nostringval--   --nostringval-
-   2   %stopped_push   --nostringval--   --nostringval--   --nostringval--   fa
lse   1   %stopped_push   1983   1   3   %oparray_pop   1982   1   3   %oparray_
pop   1966   1   3   %oparray_pop   --nostringval--   --nostringval--   --nostri
ngval--   --nostringval--   --nostringval--   --nostringval--   --nostringval--
  --nostringval--   --nostringval--
Dictionary stack:
   --dict:1193/1684(ro)(G)--   --dict:1/20(G)--   --dict:82/200(L)--   --dict:82
/200(L)--   --dict:116/127(ro)(G)--   --dict:280/300(ro)(G)--   --dict:24/32(L)-
-
Current allocation mode is local
GPL Ghostscript 9.15: Unrecoverable error, exit code 1
 @ error/pdf.c/InvokePDFDelegate/263.
convert.exe: no images defined `test.pdf' @ error/convert.c/ConvertImageCommand/
3210.

Question
What am I missing here? I am not a programmer, so please consider this in your answer. I am very grateful if you could show how to do this in batch process.

Notes

  • Windows 7 32bit

  • Ghostscript version 9.15

  • Quality of image is not an issue for me even 72dpi will be fine

  • I want to strike a balance between size of the output and clarity of text. I just want the text to be readable on the web and not to do some OCR processing with it, so image doesn't need to be very sharp. Size of output is more important, the less the better and honestly I am clueless as to what might works better; to convert the PDF file into PNG or into JPEG in this case.

  • I don't want to burst a PDF into multiple serially named PNGs or JPEGs, simply one PDF to another PDF but as images inside and no more copy&paste-prone text.

Update
I tried to make a minimal working example PDF to mimic the original PDF and found that problem arises by including a certain Arabic font called (AH) Manal Black. Running pdffonts from command line on this MWE PDF gives:

Syntax Error (18062): Illegal character ')'
Syntax Error (18076): Dictionary key must be a name object
Syntax Error (18085): Dictionary key must be a name object
Syntax Error (18248): Illegal character ')'
Syntax Error (18248): Dictionary key must be a name object
Syntax Error (18253): Dictionary key must be a name object
Syntax Error (18599): Illegal character ')'
Syntax Error (18599): Dictionary key must be a name object
Syntax Error (18607): Dictionary key must be a name object
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
GAKHDJ+(AH                           CID TrueType      yes yes yes      5  0
HTCSVQ+Amiri-Regular                 CID TrueType      yes yes yes      7  0

By excluding this Arabic font when compiling the document using LaTeX/XeTeX engine, the convert command works just fine like in other English PDFs. So most probably this problem is linked to parsing of the fonts.


Update: A minimally working example is here: https://www.dropbox.com/s/qdeuzips0ivas4q/mwe_ar.pdf?dl=0

回答1:

The minimally working example has PDF object no. 10 as an ObjStm (object stream), where this part can be found (I edited the whitespace formatting for improved readability):

<<  /Type               /Font
    /Subtype            /Type0
    /BaseFont           /GAKHDJ+#28AH)#20Manal#20Black
    /Encoding           /Identity-H
    /DescendantFonts    [4 0 R]
    /ToUnicode          12 0 R
>>

So the font name, (AH) Manal Black, has properly hex-escaped the blanks as #20 and the opening parenthesis ( as #28, but it hasn't hex-escaped the closing parenthesis ) as #29, as it should.

Without knowing more about the PDF generating process, I guess that the Creator/Producer combo as given through the file's metadata,

Creator:    XeTeX output 2015.05.01:1207
Producer:   xdvipdfmx (20140317)

is to be blamed. This is a bug in the PDF generating software...


Update

Maybe I should reveal how I dissected and uncompressed the MWE PDF:

  1. Trying it with QPDF didn't work:

    qpdf --qdf --object-streams=disable mwe_ar.pdf qdf.pdf
    
     object stream 10 (file position 585): unexpected )
    
  2. Trying it with pdftk didn't work either:

    pdftk mwe_ar.pdf cat pdftk.pdf uncompress
    
     Error: Unable to find file.
     Error: Failed to open PDF file: 
        mwe_ar.pdf
     Errors encountered.  No output created.
     Done.  Input errors, so no output created.
    
  3. Trying it with MuPDF's mutool also failed:

    mutool clean -d mwe_ar.pdf mutool.pdf
    
     warning: lexical error (unexpected ')')
     error: invalid key in dict
     error: cannot parse dict
     error: cannot open object stream (10 0 R)
     error: cannot load object stream containing object (1 0 R)
     warning: cannot load object (1 0 R) into cache
     warning: lexical error (unexpected ')')
     error: invalid key in dict
     error: cannot parse dict
     error: cannot open object stream (10 0 R)
     error: cannot load object stream containing object (4 0 R)
     error: cannot load object (4 0 R) into cache
    
  4. Finally, as a last resort, PeePDF.py to the rescue:

    $ cat peepdf-commands.txt
    
     object 10
    
    $ peepdf.py -s peepdf-commands.txt
    
      << /Length 1000
      /N 13
      /Type /ObjStm
      /Filter /FlateDecode
      /First 84 >>
      stream
      9 0 3 72 11 133 2 197 1 312 15 343 4 446 14 625 19 876 6 1344 18 1514 5 1758 7 1886 <</Font<</F1 5 0 R/F2 7 0 R>>/ProcSet[/PDF/Text/ImageC/ImageB/ImageI]>>
      <</Resources 9 0 R/Type/Page/Parent 11 0 R/Contents[8 0 R]>>
      <</Type/Pages/Count 1/Kids[3 0 R]/MediaBox[0 0 595.28 841.89]>>
      <</Creator( XeTeX output 2015.05.01:1207)/Producer(xdvipdfmx \(20140317\))/CreationDate(D:20150501120749+01'00')>>
      <</Pages 11 0 R/Type/Catalog>>
      [417[251]421[257]424[368]443[470]445[355]450[380]480[322]498[480 233]505[461]508[256]514[326]520[264]]
      <</Type/Font/Subtype/CIDFontType2/BaseFont/GAKHDJ+#28AH)#20Manal#20Black/FontDescriptor 14 0 R/CIDSystemInfo<</Registry(Adobe)/Ordering(Identity)/Supplement 0>>/DW 199/W 15 0 R>>
      <</Type/FontDescriptor/Ascent 529/Descent -415/StemV 109/CapHeight 529/AvgWidth 392/FontBBox[-112 -321 1006 1137]/ItalicAngle 0/Flags 6/Style<</Panose<000000000000000000000000>>>/FontName/GAKHDJ+#28AH)#20Manal#20Black/FontFile2 16 0 R/CIDSet 17 0 R>>
      [39[693]41[522]51[535]108[415]124[415]388[218 926]402[1213]406[541]446[586]1886[317]1992[229]2016[366]2021[366]2105[244]2108[244]2139[1006]2150[295]2162[378]2227[379 452]2272[589]2294[176]2300[198]2308[389]2339[343]2356[723]2359[1079]2397[552]2413[346]2457[177]2491[299]2912[349]2952[219]2969[209]2973[148]2976[302]2981[341]3027[168]3149[550]3297[259]3325[292]3726[248]3732[319]3853[411]3893[179]4021[55]4323[104]4627[560]5068[238]5106[476]5322[159]5328[222]6366[93]]
      <</Type/Font/Subtype/CIDFontType2/BaseFont/HTCSVQ+Amiri-Regular/FontDescriptor 18 0 R/CIDSystemInfo<</Registry(Adobe)/Ordering(Identity)/Supplement 0>>/DW 190/W 19 0 R>>
      <</Type/FontDescriptor/Ascent 1123/Descent -635/StemV 87/CapHeight 1123/AvgWidth 685/FontBBox[-581 -900 11467 1815]/ItalicAngle 0/Flags 6/Style<</Panose<000000000500000000000000>>>/FontName/HTCSVQ+Amiri-Regular/FontFile2 20 0 R/CIDSet 21 0 R>>
      <</Type/Font/Subtype/Type0/BaseFont/GAKHDJ+#28AH)#20Manal#20Black/Encoding/Identity-H/DescendantFonts[4 0 R]/ToUnicode 12 0 R>>
      <</Type/Font/Subtype/Type0/BaseFont/HTCSVQ+Amiri-Regular/Encoding/Identity-H/DescendantFonts[6 0 R]/ToUnicode 13 0 R>>
    
      endstream
    

The more often I use PeePDF.py, the more I love it. Thanks, Jose Miguel, for this wonderful tool!



回答2:

I usually use pdftocairo to fix that:

pdftocairo corruptedinfile.pdf -pdf outfile.pdf

After that, ghostscript can handle it properly.