How should I format URLs with special/international characters?
Currently I try to make URLs "look good", so that:
www.myhost.com/this is a test, do you know how?
is converted to:
www.myhost.com/this_is_a_test_do_you_know_how
I know some international letters could be converted (ü = ue, æ = ae, å = aa), some characters could be removed. I general I try to make the URL look "good", but is that stupid?
But what do I do with chinese, japanese, arabian letters that has nothing to do with our western ASCII format?
I really don't like the idea of rewriting the URL with hex codes, so right now I just use my internal unique ID if the url contains too many "non convertable" characters.
if you're using .NET with not
but if you want to use the scandinavian chars or whatever char you want, you just need to set up the rule in your URL ReWriting component because DynamicWeb CMS software uses the all chars available, only replace spaces by underscores ('_')
like this url:
you can see the æ in the domain as well the ø in the page name
But doesn't Google take advantage of the URL? If some of the text from a given article is in the URL Google search engine will use that? But if there really is no cool way of handling the non-ascii letters, then those languages is lower prioritized on the "google-internet?"
Have a look at say, http://ja.wikipedia.org/ . If you mouseover the links, they show up in the status bar as Japanese characters. Doesn't look so Japanese in the location bar when you follow the link, but that possibly can't be helped. Haven't checked, but I assume it's all utf8 hex-encoded.
What language are you using? PHP includes a function filter_var() that seems to do most of what you want. See http://us.php.net/manual/en/function.filter-var.php.
In general, the cost of making human-readable ASCII strings from arbitrary string input is probably too great to be worth it. If the user gives you a Chinese hanzi, what are you going to do? Look it up in a dictionary and output the result in pinyin?
The best, most general solution is simply to take the input, format it as UTF-8, then url-encode the result. This will make non-Latin text unreadable, but there is no good, general solution for those languages anyway. The language you're using almost certainly has library functions that can make this easy.