Is possible to extract text from PDF file in respect to specific font/font size/font color etc.? I prefer perl, python or *nix command line utilities. My goal is to extract all headlines from PDF file so I will have nice index of articles contained in single PDF.
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
回答1:
Text and /font/font size/position (no color, as I checked) you can get from Ghostscript's txtwrite device (try -dTextFormat=0 | 1 options), as well as from mudraw's (MuPDF) with -tt option. Then parse XML-like output with e.g. Perl.