Ghostscript PDF to PDF/A conversion font issues

2019-06-02 04:56发布

I am exploring tools to convert PDF documents to PDF/A. Ghostscript seems to give out of the box support for such a conversion. One issue seems to be that some true type fonts that are a part of the original PDF document are not converted correctly. If I copy a text from the converted PDF/A document, and paste it in notepad, the copied text appears to be garbled text.

The original document text can be copied to notepad just fine.

I am using the following script:

gswin64 -dPDFA -dBATCH -dNOPAUSE -dUseCIEColor -sProcessColorModel=DeviceCMYK -sDEVICE=pdfwrite -sPDFACompatibilityPolicy=1 -sOutputFile=FilteredOutput.pdf Filtered1Page.pdf

I have uploaded a sample 1 page source PDF in Google Drive: SampleInput

A sample output PDF/A document generated from the command is in Google drive here: SampleOutput

Running the above query on this PDF in a windows machine will reproduce the issue.

Are there any settings / commands make the PDF/A conversion to be handled properly?

1条回答
成全新的幸福
2楼-- · 2019-06-02 05:49

Copy and paste from a PDF is not guaranteed. Subset fonts will not have a usable Encoding (such as ASCII or UTF-8), in which case they will only be amenable to cut/paste/search if they have an associated ToUnicode CMap, many PDF files do not contain ToUnicode CMaps.

Of course, the PDF/A specification states (oddly in my opinion) that you should not use subset fonts, but its not always possible to tell whether a font is subset (not all creators follow the XXXXX+ convention), and even if the font isn't subset there still isn't any guarantee that its Encoding is one that is usable.

Looking at the file you have posted, it does not contain one of the fonts it uses (Arial,Bold) and so Ghostscript substitutes with DroidSansFallback, and the font it does contain (FreeSansBold) is a subset (FWIW this font doesn't actually seem to be used....). The fallback font is a CIDFont, so there is no real prospect of the text being 'correct'.

I believe that if you make a real font available to Ghostscript to replace Arial,Bold then it will probably work correctly. This would also fix the rather more obvious problem of the spacing of the characters being incorrect (in one place, wildly incorrect), which is caused by the fallback font having different widths to the original.

NB as the warning messages have already told you don't use -dUseCIEColor.

The fact that you cannot copy/paste/search a PDF does not mean that it is not a valid PDF/A-1b file though, so thsi does not mean that the creation (NOT conversion) of the PDF/A-1b is not 'proper'.

查看更多
登录 后发表回答