Escaped # in URLs, sitemap and handling by Google

2019-05-17 19:45发布

问题:

We have a large set of URLs of which some contain a hash character. The hash is not to indicate a fragment, but part of the URL path, so we escape the hash by %23, e.g.

http://example.com/example%231
http://example.com/another-example%232
…

Our sitemap.xml lists these URLs as follows:

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>http://example.com/example%231</loc>
  </url>
  <url>
    <loc>http://example.com/another-example%232</loc>
  </url>
  <!-- and so on … -->
</urlset>

Now, the Google Search Console reports 404 errors for the following URLs:

http://example.com/example
http://example.com/another-example

Note, that the strings after the %23 got stripped away. I would understand this behavior, if the sitemap contained e.g. http://example.com/example#1, but we’re intentionally encoding the hash (http://example.com/example%231).

Is there anything I might be misunderstanding, or are there any special rules for escaping within sitemap.xml?

回答1:

Google don't want you to use fragments in that way. They do, however, still see them as actual fragment identifiers, e.g. direct links from a search result to multiple subheadings of Wikipedia articles.

So Google probably interprets your hashes as fragment IDs, and therefore strips them from your URLs, thereby getting 404s.

XML Sitemaps follow usual escaping set out in RSC 3986. There's some history around Google's deprecated use of !# URLs for Ajax that may be useful background.