What is the best perl module to extract text from

2020-07-11 05:32发布

What is the best way to extract text from a pdf?

1条回答
SAY GOODBYE
2楼-- · 2020-07-11 06:21

The CAM::PDF module is pretty useful for extracting text and maintaining some information about where it came from in the document. It installs /usr/local/bin/getpdftext.pl which demonstrates simple extraction. However, CAM::PDF can only read PDFs that are completely valid.

If you are dealing with ill-formed PDFs, you may need a more lenient parser, such as pdftotext. It dumps foo.pdf to foo.txt, which you could then read into Perl.

查看更多
登录 后发表回答