iTextSharp library (version 5.5.5) does not extract text from my file.
I can copy and paste text from pdf into Notepad.
I uploaded file to this link.
The source code is very simple and it works for other pdf files, but for this problematic file all I get is some characters without any meaning.
var text = string.Empty;
using (var file = new File.OpenRead(path))
{
using (var reader = new PdfReader(file))
{
for (int pageNumber = 1; pageNumber <= reader.NumberOfPages; pageNumber++)
{
text += PdfTextExtractor.GetTextFromPage(reader, pageNumber);
}
}
}
Any help is highly appreciated.
The PDF declarations of the Asian fonts in your sample PDF do not contain a ToUnicode map to allow mapping from character codes to Unicode.
Furthermore, their encoding is Identity-H which is kind of a pseudo-encoding as it merely maps 2-byte character codes ranging from 0 to 65,535 to the same 2-byte CID value, so this still doesn't define a fixed encoding usable for text extraction.
Identity-H may actually only be used with CIDFonts using any Registry, Ordering, and Supplement values, and these ROS values convey the actual encoding information from which a mapping to Unicode can be derived. This is the case in your file.
To make use of these ROS values during text extraction, iText needs a set of resource files defining the mappings for the different predefined ROS values. As these files are quite huge, they are not part of the standard iText main distribution jar/dll but have to be added to the class path as a separate jar/dll file.
I only tested this using the Java version of iText as I am more proficient with it.
iText 5.x/Java
The Maven coordinates for the 5.x version of this jar artifact:
<dependency>
<groupId>com.itextpdf</groupId>
<artifactId>itext-asian</artifactId>
<version>5.2.0</version>
</dependency>
(As nothing has changed in these resources in the course of the recent years, there have been no 5.x releases since 5.2.0.)
After I added that jar to the classpath here, I could successfully extract Asian characters from your PDF. Whether they are 100% correct, I cannot say as I cannot read them.
iTextSharp 5.x/.Net
There should be a similar iTextSharp DLL with Asian font resources. (I found the iText 7 variant thereof but I am not sure that that works with a 5.x iTextSharp.)
Googl'ing around one finds a number of iTextAsian-*
, iTextAsianCmaps-*
, and iTextAsian-all-*
files... I don't know, though, which of them work with the current iTextSharp 5.5.12.
As the OP found out, one additionally has to register the DLLs for iTextSharp (in contrast to iText / Java):
Here is how to notify iTextSharp that Asian dlls are in the project. You need to add static constructor of yours text extraction class:
static PdfDocument()
{
iTextSharp.text.io.StreamUtil.AddToResourceSearch("iTextAsian.dll");
iTextSharp.text.io.StreamUtil.AddToResourceSearch("iTextAsianCmaps.dll");
}
I have addition to the answer given by @mkl. Here is how to notify iTextSharp that Asian dlls are in the project. You need to add static constructor of yours text extraction class:
static PdfDocument()
{
iTextSharp.text.io.StreamUtil.AddToResourceSearch("iTextAsian.dll");
iTextSharp.text.io.StreamUtil.AddToResourceSearch("iTextAsianCmaps.dll");
}