I'd like to know if there is any way to check if there is a string inside a pdf
file using a shell script? I was looking for something like:
if [search(string,pdf_file)] > 0 then
echo "exist"
fi
I'd like to know if there is any way to check if there is a string inside a pdf
file using a shell script? I was looking for something like:
if [search(string,pdf_file)] > 0 then
echo "exist"
fi
As nicely pointed by Simon, you can simply convert the pdf
to plain text using pdftotext
, and then, just search for what you're looking for.
After conversion, you may use grep
, bash regex, or any variation you want:
while read line; do
if [[ ${line} =~ [0-9]{4}(-[0-9]{2}){2} ]]; then
echo ">>> Found date;";
fi
done < <(pdftotext infile.pdf -)
This approach converts the .pdf files page-wise, so the occurences of the search string $query
can be located more specifically.
# search for query string in available pdf files pagewise
for i in *.pdf; do
pagenr=$(pdfinfo "$i" | grep "Pages" | grep -o "[0-9][0-9]*")
fileid="\n$i\n"
for (( p=1; p<=pagenr; p++ )); do
matches=$(pdftotext -q -f $p -l $p "$i" - | grep --color=always -in "$query")
if [ -n "$matches" ]; then
echo -e "${fileid}PAGE: $p"
echo "$matches"
fileid=""
fi
done
done
pdftotext -f $p -l $p
limits the range to be converted to only one page identified by the number $p
. grep --color=always
allows for protecting match highlights in the subsequent echo
. fileid=""
just makes sure the file name of the .pdf document is only printed once for multiple matches.
Each letter within a PDF doc is typically set individually. Therefore, you have to convert the .pdf to text, which will reduce the text to a simple stream.
I would try this:
grep -q 'a \+string' <(pdf2text some.pdf - | tr '\n' ' ') && echo exists
The tr
joins line breaks. The \+
allows for 1 or more space chars between words. Finally, grep -q
only returns exit status 0/1
based on a match. It does not print matching lines.