I have a bash script to check the HTTP status code of a list of urls, but I realize that some, while appearing to be "200", display actually a page containing "error 404". How could I check for that ?
Here's my current script :
#!/bin/bash
while read LINE; do
curl -o /dev/null --silent --head --write-out '%{http_code}\n' "$LINE"
done < url-list.txt
(I got it from a precedent question : script to get the HTTP status code of a list of urls ?)
EDIT There seems to be a bug in the script : it returns "200" but if I wget -o log
that same adress I get "404 not found"
For the fun - here is an BASH solution:
dosomething() {
code="$1"; url="$2"
case "$code" in
200) echo "OK for $url";;
302) echo "redir for $url";;
404) echo "notfound for $url";;
*) echo "other $code for $url";;
esac
}
#MAIN program
while read url
do
uri=($(echo "$url" | sed 's~http://\([^/][^/]*\)\(.*\)~\1 \2~'))
HOST=${uri[0]:=localhost}
FILE=${uri[1]:=/}
exec {SOCKET}<>/dev/tcp/$HOST/80
echo -ne "GET $FILE HTTP/1.1\nHost: $HOST\n\n" >&${SOCKET}
res=($(<&${SOCKET} sed '/^.$/,$d' | grep '^HTTP'))
dosomething ${res[1]} "$url"
done << EOF
http://stackoverflow.com
http://stackoverflow.com/some/bad/url
EOF
Well, you could grok the response body and look for "404", "Error 404", "Not Found", "404 Not Found" etc printed in plaintext, but that is likely to give both false negatives and false positives. Though if the server sends 200 for what's supposed to be a 404 somebody didn't do their job right.