I have lots of UTF-8 content that I want inserted into the URL for SEO purposes. For example, post tags that I want to include in th URI (site.com/tags/id/TAG-NAME
). However, only ASCII characters are allowed by the standards.
Characters that are allowed in a URI
but do not have a reserved purpose are
called unreserved. These include
uppercase and lowercase letters,
decimal digits, hyphen, period,
underscore, and tilde.
The solution seems to be to:
- Convert the character string into a
sequence of bytes using the UTF-8
encoding
- Convert each byte that is
not an ASCII letter or digit to %HH,
where HH is the hexadecimal value of
the byte
However, that converts the legible (and SEO valuable) words into mumbo-jumbo. So I'm wondering if google is still smart enough to handle searches in URL's that contain encoded data - or if I should attempt to convert those non-english characters into there semi-ASCII counterparts (which might help with latin based languages)?
Firstly, search engines really don't care about the URLs. They help visitors: visitors link to sites, and search engines care about that. URLs are easy to spam, if they cared there would be incentive to spam. No major search engines wants that. The allinurl:
is merely a feature of google to help advanced users, not something that gets factored into organic rankings. Any benefits you get from using a more natural URL will probably come as a fringe benefit of the PR from an inferior search engine indexing your site -- and there is some evidence this can be negative with the advent of negative PR too.
From Google Webmaster Central
Does that mean I should avoid
rewriting dynamic URLs at all?
That's
our recommendation, unless your
rewrites are limited to removing
unnecessary parameters, or you are
very diligent in removing all
parameters that could cause problems.
If you transform your dynamic URL to
make it look static you should be
aware that we might not be able to
interpret the information correctly in
all cases. If you want to serve a
static equivalent of your site, you
might want to consider transforming
the underlying content by serving a
replacement which is truly static. One
example would be to generate files for
all the paths and make them accessible
somewhere on your site. However, if
you're using URL rewriting (rather
than making a copy of the content) to
produce static-looking URLs from a
dynamic site, you could be doing harm
rather than good. Feel free to serve
us your standard dynamic URL and we
will automatically find the parameters
which are unnecessary.
I personally don't believe it matters all that much short of getting a little more click through and helping users out. So far as Unicode, you don't understand how this works: the request goes to the hex-encoded unicode destination, but the rendering engine must know how to handle this if it wishes to decode them back to something visually appealing. Google will render (aka decode) unicode (encoded) URL's properly.
Some browsers make this slightly more complex by always encoding the hostname portion, because of phishing attacks using ideographs that look the same.
I wanted to show you an example of this, here is request to http://hy.wikipedia.org/wiki/Գլխավոր_Էջ issued by wget:
Hypertext Transfer Protocol
GET /wiki/%D4%B3%D5%AC%D5%AD%D5%A1%D5%BE%D5%B8%D6%80_%D4%B7%D5%BB HTTP/1.0\r\n
[Expert Info (Chat/Sequence): GET /wiki/%D4%B3%D5%AC%D5%AD%D5%A1%D5%BE%D5%B8%D6%80_%D4%B7%D5%BB HTTP/1.0\r\n]
[Message: GET /wiki/%D4%B3%D5%AC%D5%AD%D5%A1%D5%BE%D5%B8%D6%80_%D4%B7%D5%BB HTTP/1.0\r\n]
[Severity level: Chat]
[Group: Sequence]
Request Method: GET
Request URI: /wiki/%D4%B3%D5%AC%D5%AD%D5%A1%D5%BE%D5%B8%D6%80_%D4%B7%D5%BB
Request Version: HTTP/1.0
User-Agent: Wget/1.11.4\r\n
Accept: */*\r\n
Host: hy.wikipedia.org\r\n
Connection: Keep-Alive\r\n
\r\n
As you can see, wget like every other browser will just url-encode the destination for you, and the continue the request to the url-encoded destination. The url-decoded domain only exists as a visual convenience.
Do you know what language everything will be in? Is it all latin based?
If so, then I would suggest building a sort of lookup table that will convert UTF-8 to ASCII when possible(and non-colliding) Something like that would convert Ź into Z and such, and when there is a collision or the character doesn't exist in your lookup table, then it just uses %HH.