Nginx location match regex for special characters

2019-08-07 15:12发布

I've been trying so many things today and I am just not winning. I have one file in my site which got created by accident with a special character in it. As a result Googlebot has stopped crawling for 3 weeks now and Webmaster tools / Search console keeps notifying me and wanting to retest the url.

All I want to achieve is to configure Nginx to match the following requests and redirect them to the correct location but regex has me stumped on this one.

The unencoded URL string is:

/historical-rainfall-trends-south-africa-1921–2015.pdf

The encoded URL string is:

/historical-rainfall-trends-south-africa-1921%C3%A2%E2%82%AC%E2%80%9C2015.pdf

How can I get a location match for these?

UPDATE:

Still losing my mind, nothing I have tried is working. I get a match with this regex here - https://regex101.com/r/3Lk2zr/3

but then using this

location ~ /.*[^\x00-\x7F]+.* { return 444; }

still gives me a 404 and not a 444

Likewise I get a match with this - https://regex101.com/r/80KWJ8/1 But then

location ~ /.*([^?]*)\%(.*)$ { return 444; }

Gives 404 and not 444

3条回答
Luminary・发光体
2楼-- · 2019-08-07 16:08

Your solution is terrible, let me tell you why.

Every single request which matches this location block now has to be evaluated against two if conditions before being served.

Any request which matches then gets redirected to the correct url, which also matches this location block so now your server is doing another two evaluations of those if conditions.

Just for fun you are also making Nginx evaluate requests for image, css and js files against your if conditions too. None of them will match as you are worried about a pdf, but you are still adding an extra 200% overhead to the request processing.

A much more Nginx friendly solution is actually very simple.

Nginx does regex matching in the order the location directives are listed in your config and chooses the first matching block, so if this file url will match any of your other regex directives then you need to place this block above those locations:

location ~* /historical-rainfall-trends-south-africa-1921([^_])*?2015\.pdf$ {
    return 301 https://example.com/resources/weather-documents/historical-rainfall-trends-south-africa_1921_2015.pdf;
}

Just tested it on one of my servers running Nginx 1.15.1, works a charm.

查看更多
Lonely孤独者°
3楼-- · 2019-08-07 16:09

Temporary Solution

Thanks to @funilrys and also this How do I redirect all requests that contains a certain string to 404 in nginx?

This works now 100%

location /resources { expires 3h; add_header Cache-Control 'must-revalidate, proxy-revalidate, max-age=10800'; location ~* \.(jpg|jpeg|png|gif|ico|css|js)$ { expires 3h; add_header Cache-Control 'must-revalidate, proxy-revalidate, max-age=10800'; } location ~* \.(pdf)$ { expires 30d; add_header Cache-Control 'must-revalidate, proxy-revalidate, max-age=2592000'; if ($request_uri ~ .*%.*) { return 301 https://example.com/resources/weather-documents/historical-rainfall-trends-south-africa_1921_2015.pdf; } if ($request_uri ~ .*[^\x00-\x7F]+.*) { return 301 https://example.com/resources/weather-documents/historical-rainfall-trends-south-africa_1921_2015.pdf; } }

查看更多
对你真心纯属浪费
4楼-- · 2019-08-07 16:12

I don't know about Nginx and the way it handles regex but :

  • You could try to match for percent in the encoded URL with:

    %+

  • You could try to match for the special chars in the encoded URL with:

    (%([A-Z][0-9]|[0-9][A-Z]|[0-9]+|[A-Z]+))+

  • You could try to match for non-ASCII chars in the unencoded URL with:

    [^\x00-\x7F]+

Proofs:

查看更多
登录 后发表回答