I have lots of UTF-8 content that I want inserted into the URL for SEO purposes. For example, post tags that I want to include in th URI (site.com/tags/id/TAG-NAME
). However, only ASCII characters are allowed by the standards.
Characters that are allowed in a URI but do not have a reserved purpose are called unreserved. These include uppercase and lowercase letters, decimal digits, hyphen, period, underscore, and tilde.
The solution seems to be to:
- Convert the character string into a sequence of bytes using the UTF-8 encoding
- Convert each byte that is not an ASCII letter or digit to %HH, where HH is the hexadecimal value of the byte
However, that converts the legible (and SEO valuable) words into mumbo-jumbo. So I'm wondering if google is still smart enough to handle searches in URL's that contain encoded data - or if I should attempt to convert those non-english characters into there semi-ASCII counterparts (which might help with latin based languages)?
Do you know what language everything will be in? Is it all latin based?
If so, then I would suggest building a sort of lookup table that will convert UTF-8 to ASCII when possible(and non-colliding) Something like that would convert Ź into Z and such, and when there is a collision or the character doesn't exist in your lookup table, then it just uses %HH.
Firstly, search engines really don't care about the URLs. They help visitors: visitors link to sites, and search engines care about that. URLs are easy to spam, if they cared there would be incentive to spam. No major search engines wants that. The
allinurl:
is merely a feature of google to help advanced users, not something that gets factored into organic rankings. Any benefits you get from using a more natural URL will probably come as a fringe benefit of the PR from an inferior search engine indexing your site -- and there is some evidence this can be negative with the advent of negative PR too.From Google Webmaster Central
I personally don't believe it matters all that much short of getting a little more click through and helping users out. So far as Unicode, you don't understand how this works: the request goes to the hex-encoded unicode destination, but the rendering engine must know how to handle this if it wishes to decode them back to something visually appealing. Google will render (aka decode) unicode (encoded) URL's properly.
Some browsers make this slightly more complex by always encoding the hostname portion, because of phishing attacks using ideographs that look the same.
I wanted to show you an example of this, here is request to http://hy.wikipedia.org/wiki/Գլխավոր_Էջ issued by wget:
As you can see, wget like every other browser will just url-encode the destination for you, and the continue the request to the url-encoded destination. The url-decoded domain only exists as a visual convenience.