PDFTextStripper parsing with wrong encoding

2019-03-03 12:27发布

问题:

PDFTextStripper stripper = new PDFText2HTML(encoding);
String result = stripper.getText(document).trim();

result contains something like

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
 "http://www.w3.org/TR/html4/loose.dtd"> <html><head><title>Inserat
 SeLe EE rev</title> <meta http-equiv="Content-Type"
 content="text/html; charset=utf-8"> </head> <body> <div
 style="page-break-before:always;
 page-break-after:always"><div><p>&#0;&#1;&#2;&#3;&#4;&#5;&#6;&#7;&#...

instead of

 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
 "http://www.w3.org/TR/html4/loose.dtd"> <html><head><title>Inserat
 SeLe EE rev</title> <meta http-equiv="Content-Type"
 content="text/html; charset=utf-8"> </head> <body> <div
 style="page-break-before:always; page-break-after:always"><div><p>any
 blablabla characters...

When I changed encoding to windows-1252 or utf-8 result not changed. Bad pdf url http://www.permaco.ch/fileadmin/user_upload/jobs/Inserat_SeLe_EE_rev.pdf

How to parse this pdf?

回答1:

How to parse this pdf?

Short of OCR'ing it you don't.

The PDF in question does not contain the information required to extract text without doing at least some OCR (at least OCR'ing each character of the used font to find a mapping from glyph to character) which would require additional libraries and code.

As a requirement for text extraction the PDF specification ISO 32000-1:2008 correctly states in section 9.10.2 that the font used for the text to extract needs to

  • either contain a ToUnicode CMap — the font used in your document doesn't —
  • or be a composite font that uses one of the predefined CMaps listed in Table 118 (except Identity–H and Identity–V) or whose descendant CIDFont uses the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, or Adobe-Korea1 character collection — the font used in your document isn't —
  • or be a simple font that uses one of the predefined encodings MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding, or that has an encoding whose Differences array includes only character names taken from the Adobe standard Latin character set and the set of named characters in the Symbol font — the font used in your document neither uses one of those predefined encodings nor are the character names in its Differences array from those selections mentioned: the names used are /0, /1, ..., /155.

Generally a good first test is to try and copy&paste text using Adobe Reader as much text extraction experience is in the Reader's code. When trying to do so, you'll see that you only get garbage.