Copy+pasting text from PDF results in garbage

2019-04-06 01:18发布

I am writing a Master's thesis - NLP system. I have one component - extractor.

It is extracting a plain text from PDF files. There are a few PDF files that can not be extracted correctly. Extractor (PDFBox library) returns a string like this:

"┤xDn║if|d├gDF"Ti&cD╬lh d FÁhis~n ╗xd f«"d┤ffih »h"

or

"10a61a91a22a25a3a27a17a23a20a8a13a14a61a25a17"

I was checking each file that makes this extraction's problem and all these files' text also can not be copy-pasted from PDF Reader (Adobe Reader and FoxIt reader). Viewing them in this readers is enabled, but after selecting its content and copying to the clipboard I get the same wrong text (as described above - strings of not semantically correct chars or strings of digits and letters).

Could anybody help me???

标签: pdf pdfbox
7条回答
做个烂人
2楼-- · 2019-04-06 01:45

When opened as a Gmail attachment in Chrome (the internal PDF browser) copying does copy normal readable characters!

It worked for me when I had this problem and for others as well. I think the Chrome PDF viewer uses the Google Drive OCR automatically... It's like magic!

查看更多
登录 后发表回答