Facebook externalhit_uatext robot lowercasing urls

2019-05-07 08:36发布

问题:

I'm working on a site that has mixed-case urls, similar to youtube. We generate IDs on the server, and I chose base 62 (numbers, lower and uppercase letters) so they would be shorter. So the urls might be something like example.com/user/123AbCaBc The facebook robot seems to be hitting my site regularly with an all-lowercase version example.com/user/123abcabc This causes a 404 error as the all-lowercase ID isn't in the database.

According to the logs, there aren't other user agents creating 404s, so this is for sure a robot and not a human. Here's the user agent I'm seeing:

facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)

This happens about once every 4 minutes. I'm not currently logging non-404 hits, so I'm not sure if there are others to the non-lowercase version.

The server tech here is nodejs / mongodb, but I don't see how that is relavant to the issue at hand.

Is there something I can do to fix facebook? Is there a problem here, or should I squealch these log errors? Anyone else have a similar problem?

回答1:

It's possible that you Node "Webserver application" (are you using Express?) currently doesn't support byte ranges. The Facebook crawler apparantly has the behaviour to fallback on lowercasing the URL as described here:

  • https://mail.habari.co.tz/pipermail/linux/2013-June/000180.html

Have a look at

  • http://derickbailey.com/2014/04/28/check-http-byte-range-request-header-with-nodejs-and-expressjs/
  • http://www.codeproject.com/Articles/813480/HTTP-Partial-Content-In-Node-js

on how to fix this.



标签: facebook url