Check if a URL goes to a page containing the text

2019-04-12 06:17发布

问题:

I have a bash script to check the HTTP status code of a list of urls, but I realize that some, while appearing to be "200", display actually a page containing "error 404". How could I check for that ?

Here's my current script :

#!/bin/bash
while read LINE; do
  curl -o /dev/null --silent --head --write-out '%{http_code}\n' "$LINE"
done < url-list.txt

(I got it from a precedent question : script to get the HTTP status code of a list of urls ?)

EDIT There seems to be a bug in the script : it returns "200" but if I wget -o log that same adress I get "404 not found"

回答1:

For the fun - here is an BASH solution:

dosomething() {
        code="$1"; url="$2"
        case "$code" in
                200) echo "OK for $url";;
                302) echo "redir for $url";;
                404) echo "notfound for $url";;
                *) echo "other $code for $url";;
        esac
}

#MAIN program
while read url
do
        uri=($(echo "$url" | sed 's~http://\([^/][^/]*\)\(.*\)~\1 \2~'))
        HOST=${uri[0]:=localhost}
        FILE=${uri[1]:=/}
        exec {SOCKET}<>/dev/tcp/$HOST/80
        echo -ne "GET $FILE HTTP/1.1\nHost: $HOST\n\n" >&${SOCKET}
        res=($(<&${SOCKET} sed '/^.$/,$d' | grep '^HTTP'))
        dosomething ${res[1]} "$url"
done << EOF
http://stackoverflow.com
http://stackoverflow.com/some/bad/url
EOF


回答2:

Well, you could grok the response body and look for "404", "Error 404", "Not Found", "404 Not Found" etc printed in plaintext, but that is likely to give both false negatives and false positives. Though if the server sends 200 for what's supposed to be a 404 somebody didn't do their job right.