Find string inside pdf with shell

I'd like to know if there is any way to check if there is a string inside a pdf file using a shell script? I was looking for something like:

if [search(string,pdf_file)] > 0 then  
   echo "exist"
fi

标签： linux bash shell unix pdf

3条回答

小情绪 Triste *

2楼-- · 2019-05-29 09:51

Each letter within a PDF doc is typically set individually. Therefore, you have to convert the .pdf to text, which will reduce the text to a simple stream.

I would try this:

grep -q 'a \+string' <(pdf2text some.pdf - | tr '\n' ' ') && echo exists

The tr joins line breaks. The \+ allows for 1 or more space chars between words. Finally, grep -q only returns exit status 0/1 based on a match. It does not print matching lines.

0人赞添加讨论(0) 举报

孤傲高冷的网名

3楼-- · 2019-05-29 09:57

As nicely pointed by Simon, you can simply convert the pdf to plain text using pdftotext, and then, just search for what you're looking for.

After conversion, you may use grep, bash regex, or any variation you want:

while read line; do

    if [[ ${line} =~ [0-9]{4}(-[0-9]{2}){2} ]]; then
        echo ">>> Found date;";
    fi

done < <(pdftotext infile.pdf -)

0人赞添加讨论(0) 举报

叛逆

4楼-- · 2019-05-29 10:07

This approach converts the .pdf files page-wise, so the occurences of the search string $query can be located more specifically.

# search for query string in available pdf files pagewise
for i in *.pdf; do
    pagenr=$(pdfinfo "$i" | grep "Pages" | grep -o "[0-9][0-9]*")
    fileid="\n$i\n"
    for (( p=1; p<=pagenr; p++ )); do
        matches=$(pdftotext -q -f $p -l $p "$i" - | grep --color=always -in "$query")
        if [ -n "$matches" ]; then
            echo -e "${fileid}PAGE: $p"
            echo "$matches"
            fileid=""
        fi
    done
done

pdftotext -f $p -l $p limits the range to be converted to only one page identified by the number $p. grep --color=always allows for protecting match highlights in the subsequent echo. fileid="" just makes sure the file name of the .pdf document is only printed once for multiple matches.

0人赞添加讨论(0) 举报

Find string inside pdf with shell

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间