I need to extract the text from a PDFs in Romanian language. The symbols: ȚțȘșĂăÎîÂâ are not extracted correctly with pdfBox or Snowtide.
Here is a sample file that is not working: ftp://ftp.logos.md/Biblioteca/_Colectie_RO/2nefon.pdf
Any suggestions?
I'm afraid the PDF the OP pointed at (2nefon.pdf) does not provide the information required for text extraction according to the spec.
Trying to copy&paste from Adobe Reader results in the special characters being incorrectly exported, and as Adobe Reader contains quite good text extraction capabilities, this already is a bad sign.
Inspecting the file shows the problems. E.g. let's look at the title
The corresponding segment of the content stream is:
Let's check the used font F1:
Thus, the font claims to use WinAnsiEncoding without changes (no Differences).
A last look at the font descriptor:
No hint here that the afore mentioned WinAnsiEncoding might not be the whole truth.
According to the PDF specification ISO 32000-1
So text extraction and copying&pasting are completely following the specification when reporting that the document claims those two lines say:
You might want to check, though, whether e.g. Ă (capital A with brevis) is always exported as |; this actually is not unlikely, mapping special characters to character codes of symbols was quite common for a time in the last century. If that indeed is the case, a global search&replace after text extraction gives you the desired text.
How about iText: http://itextpdf.com/
"iText® is an open source library that allows you to create and manipulate PDF documents. It enables developers looking to enhance web- and other applications with dynamic PDF document generation and/or manipulation."