How to extract text from the PDF document? [closed

2019-01-03 13:03发布

How to extract text from the PDF document using PHP?

(I can't use other tools, I don't have root access)

I've found some functions working for plain text, but they don't handle well Unicode characters:

http://www.hashbangcode.com/blog/zend-lucene-and-pdf-documents-part-2-pdf-data-extraction-437.html

标签： php pdf text unicode

2条回答

傲

2楼-- · 2019-01-03 13:27

Download the class.pdf2text.php @ https://pastebin.com/dvwySU1a (Updated on 5 of April 2014) or http://www.phpclasses.org/browse/file/31030.html (Registration required)

Code:

include('class.pdf2text.php');
$a = new PDF2Text();
$a->setFilename('filename.pdf'); 
$a->decodePDF();
echo $a->output();

The class doesn't work with all pdf's I've tested, give it a try and you may get lucky :)

If the above doesn't work, try http://pdfparser.org/

Python version

0人赞添加讨论(0) 举报

虎瘦雄心在

3楼-- · 2019-01-03 13:35

I know that this topic is quite old, but this need is still alive. I read many documents, forum and script and build a new advanced one which supports compressed and uncompressed pdf :

https://gist.github.com/smalot/6183152

Hope it helps everone

0人赞添加讨论(0) 举报

How to extract text from the PDF document? [closed

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间