Copy+pasting text from PDF results in garbage-第2页回答

Copy+pasting text from PDF results in garbage

2019-04-06 01:18发布

I am writing a Master's thesis - NLP system. I have one component - extractor.

It is extracting a plain text from PDF files. There are a few PDF files that can not be extracted correctly. Extractor (PDFBox library) returns a string like this:

"┤xDn║if|d├gDF"Ti&cD╬lh d FÁhis~n ╗xd f«"d┤ffih »h"

"10a61a91a22a25a3a27a17a23a20a8a13a14a61a25a17"

I was checking each file that makes this extraction's problem and all these files' text also can not be copy-pasted from PDF Reader (Adobe Reader and FoxIt reader). Viewing them in this readers is enabled, but after selecting its content and copying to the clipboard I get the same wrong text (as described above - strings of not semantically correct chars or strings of digits and letters).

Could anybody help me???

标签： pdf pdfbox

7条回答

做个烂人

2楼-- · 2019-04-06 01:45

When opened as a Gmail attachment in Chrome (the internal PDF browser) copying does copy normal readable characters!

It worked for me when I had this problem and for others as well. I think the Chrome PDF viewer uses the Google Drive OCR automatically... It's like magic!

0人赞添加讨论(0) 举报

上一页 1 2

Copy+pasting text from PDF results in garbage

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间