I've been trying so many things today and I am just not winning. I have one file in my site which got created by accident with a special character in it. As a result Googlebot has stopped crawling for 3 weeks now and Webmaster tools / Search console keeps notifying me and wanting to retest the url.
All I want to achieve is to configure Nginx to match the following requests and redirect them to the correct location but regex has me stumped on this one.
The unencoded URL string is:
/historical-rainfall-trends-south-africa-1921–2015.pdf
The encoded URL string is:
/historical-rainfall-trends-south-africa-1921%C3%A2%E2%82%AC%E2%80%9C2015.pdf
How can I get a location match for these?
UPDATE:
Still losing my mind, nothing I have tried is working. I get a match with this regex here - https://regex101.com/r/3Lk2zr/3
but then using this
location ~ /.*[^\x00-\x7F]+.* {
return 444;
}
still gives me a 404 and not a 444
Likewise I get a match with this - https://regex101.com/r/80KWJ8/1 But then
location ~ /.*([^?]*)\%(.*)$ {
return 444;
}
Gives 404 and not 444
Your solution is terrible, let me tell you why.
Every single request which matches this location block now has to be evaluated against two if conditions before being served.
Any request which matches then gets redirected to the correct url, which also matches this location block so now your server is doing another two evaluations of those if conditions.
Just for fun you are also making Nginx evaluate requests for image, css and js files against your if conditions too. None of them will match as you are worried about a pdf, but you are still adding an extra 200% overhead to the request processing.
A much more Nginx friendly solution is actually very simple.
Nginx does regex matching in the order the location directives are listed in your config and chooses the first matching block, so if this file url will match any of your other regex directives then you need to place this block above those locations:
Just tested it on one of my servers running Nginx 1.15.1, works a charm.
Temporary Solution
Thanks to @funilrys and also this How do I redirect all requests that contains a certain string to 404 in nginx?
This works now 100%
location /resources { expires 3h; add_header Cache-Control 'must-revalidate, proxy-revalidate, max-age=10800'; location ~* \.(jpg|jpeg|png|gif|ico|css|js)$ { expires 3h; add_header Cache-Control 'must-revalidate, proxy-revalidate, max-age=10800'; } location ~* \.(pdf)$ { expires 30d; add_header Cache-Control 'must-revalidate, proxy-revalidate, max-age=2592000'; if ($request_uri ~ .*%.*) { return 301 https://example.com/resources/weather-documents/historical-rainfall-trends-south-africa_1921_2015.pdf; } if ($request_uri ~ .*[^\x00-\x7F]+.*) { return 301 https://example.com/resources/weather-documents/historical-rainfall-trends-south-africa_1921_2015.pdf; } }
I don't know about Nginx and the way it handles regex but :
You could try to match for percent in the encoded URL with:
%+
You could try to match for the special chars in the encoded URL with:
(%([A-Z][0-9]|[0-9][A-Z]|[0-9]+|[A-Z]+))+
You could try to match for non-ASCII chars in the unencoded URL with:
[^\x00-\x7F]+
Proofs: