How can I search the content of a pdf file in linu

2019-06-14 17:16发布

Suppose I have given some journal paper in pdf format. I want to find out the title and Author List of the papers. How can I do that in shell scripts ?

标签: linux shell
2条回答
贼婆χ
2楼-- · 2019-06-14 17:42

I do not know if this works for your journal, it works on some pdf files:

strings "myjournal.pdf" | egrep "/Author|/Title" | tr '/' '\n' | egrep "Author|Title"
查看更多
Anthone
3楼-- · 2019-06-14 18:03

I worked on a project where we had to do search's in the content of a pdf file. The process that we decided to use is the following one:

First we would convert the pdf file to an image with the following command:

convert -density 500 "pdf_path.pdf" -depth 8 "image_output.png"

And after the file has been created, we use the command below to create a txt file with the pdf's content.

tesseract "image_output.png" "out_put_txt_file_name" -l por

You are probably going to have to change the -l por argument, because we use to do this for text's in portuguese.

查看更多
登录 后发表回答