iTextSharp library does not extract text from my f

2019-01-29 12:53发布

问题:

iTextSharp library (version 5.5.5) does not extract text from my file. I can copy and paste text from pdf into Notepad. I uploaded file to this link.

The source code is very simple and it works for other pdf files, but for this problematic file all I get is some characters without any meaning.

var text = string.Empty;
using (var file = new File.OpenRead(path))
{
    using (var reader = new PdfReader(file))
    {
        for (int pageNumber = 1; pageNumber <= reader.NumberOfPages; pageNumber++)
        {
            text += PdfTextExtractor.GetTextFromPage(reader, pageNumber);
        }
    }
}

Any help is highly appreciated.

回答1:

The PDF declarations of the Asian fonts in your sample PDF do not contain a ToUnicode map to allow mapping from character codes to Unicode.

Furthermore, their encoding is Identity-H which is kind of a pseudo-encoding as it merely maps 2-byte character codes ranging from 0 to 65,535 to the same 2-byte CID value, so this still doesn't define a fixed encoding usable for text extraction.

Identity-H may actually only be used with CIDFonts using any Registry, Ordering, and Supplement values, and these ROS values convey the actual encoding information from which a mapping to Unicode can be derived. This is the case in your file.

To make use of these ROS values during text extraction, iText needs a set of resource files defining the mappings for the different predefined ROS values. As these files are quite huge, they are not part of the standard iText main distribution jar/dll but have to be added to the class path as a separate jar/dll file.

I only tested this using the Java version of iText as I am more proficient with it.

iText 5.x/Java

The Maven coordinates for the 5.x version of this jar artifact:

<dependency>
    <groupId>com.itextpdf</groupId>
    <artifactId>itext-asian</artifactId>
    <version>5.2.0</version>
</dependency>

(As nothing has changed in these resources in the course of the recent years, there have been no 5.x releases since 5.2.0.)

After I added that jar to the classpath here, I could successfully extract Asian characters from your PDF. Whether they are 100% correct, I cannot say as I cannot read them.

iTextSharp 5.x/.Net

There should be a similar iTextSharp DLL with Asian font resources. (I found the iText 7 variant thereof but I am not sure that that works with a 5.x iTextSharp.)

Googl'ing around one finds a number of iTextAsian-*, iTextAsianCmaps-*, and iTextAsian-all-* files... I don't know, though, which of them work with the current iTextSharp 5.5.12.

As the OP found out, one additionally has to register the DLLs for iTextSharp (in contrast to iText / Java):

Here is how to notify iTextSharp that Asian dlls are in the project. You need to add static constructor of yours text extraction class:

static PdfDocument()
{
    iTextSharp.text.io.StreamUtil.AddToResourceSearch("iTextAsian.dll");    
    iTextSharp.text.io.StreamUtil.AddToResourceSearch("iTextAsianCmaps.dll");
}


回答2:

I have addition to the answer given by @mkl. Here is how to notify iTextSharp that Asian dlls are in the project. You need to add static constructor of yours text extraction class:

static PdfDocument()
{
    iTextSharp.text.io.StreamUtil.AddToResourceSearch("iTextAsian.dll");    
    iTextSharp.text.io.StreamUtil.AddToResourceSearch("iTextAsianCmaps.dll");
}


标签: c# itext