How do I make a Bash shell script that can identify all the .jpg, .gif, and .png files, and then identify which of these files are not linked via url(), href, or src in any text file in a folder?
Here's what I started, but I end up getting the inverse of what I want. I don't want to know referenced images, but unreferenced (aka "orphaned") images:
# Change MYPATH to the path where you have the project
find MYPATH -name *.jpg -exec basename {} \; > /tmp/patterns
find MYPATH -name *.png -exec basename {} \; >> /tmp/patterns
find MYPATH -name *.gif -exec basename {} \; >> /tmp/patterns
# Print a list of lines that reference these files
# The cat command simply removes coloring
grep -Rf /tmp/patterns MYPATH | cat
# great -- but how do I print the lines of /tmp/patterns *NOT* listed in any given
# *.php, *.css, or *.html?
With drysdam's help, I created this Bash script, which I call orphancheck.sh and call with "./orphancheck.sh myfolder".
#!/bin/bash
MYPATH=$1
find "$MYPATH" -name *.jpg -exec basename {} \; > /tmp/patterns
find "$MYPATH" -name *.png -exec basename {} \; >> /tmp/patterns
find "$MYPATH" -name *.gif -exec basename {} \; >> /tmp/patterns
for p in $(cat /tmp/patterns); do
grep -R $p "$MYPATH" > /dev/null || echo $p;
done
I'm a little late to the party (I found this page while looking for the answer myself), but in case it's useful to someone, here is a slightly modified version that returns the path with the filename (and searches for a few more file types):
#!/bin/bash
if [ $# -eq 0 ]
then
echo "Please supply path to search under"
exit 1
fi
MYPATH=$1
find "$MYPATH" -name *.jpg > /tmp/patterns
find "$MYPATH" -name *.png >> /tmp/patterns
find "$MYPATH" -name *.gif >> /tmp/patterns
find "$MYPATH" -name *.js >> /tmp/patterns
find "$MYPATH" -name *.php >> /tmp/patterns
for p in $(cat /tmp/patterns); do
f=$(basename $p);
grep -R $f "$MYPATH" > /dev/null || echo $p;
done
It's important to note, though, that you can get false positives just looking at the code statically like this, because code might dynamically create a filename that is then referenced (and expected to exist). So if you blindly delete all files whose paths are returned by this script, without some knowledge of your project, you might regret it.
ls -R *jpg *gif *png | xargs basename > /tmp/patterns
grep -f /tmp/patterns *html
The first line (recursively--your problem is ill-specified, so I thought I'd be a little general) finds all images and strips off the directory portion using basename
. Save that in a list of patterns. Then grep
using that list in all the html files.