so I have lots of users posting articles with names in different languages. I need some lib to translate thouse article names to english letters for example turn russian 'р' into eng 'r' and so on for all european languages, russian and asian languages. Where to get such lib?
45 seconds of google gave me this "This extension allows you to transliterate text in non-latin characters (such as Chinese, Cyrillic, Greek etc) to latin characters." It seems to be what I realy needed. Has any one tried this in real life?
I ended up writing a PHP library based on URLify.js from the Django project, since I found iconv() to be too incomplete. You can find it here:
https://github.com/jbroadway/urlify
Handles Latin characters as well as Greek, Turkish, Russian, Ukrainian, Czech, Polish, and Latvian.
Will iconv do?
From PHP manual:
If that won't do, check out these
As an alternative, define the character map in an array and use
str_replace
ormb_substitute_character
to do the conversion.I am not a linguist, far from it, but I submit to you the possibility that what you are trying to do is impossible, or extremely complex to implement.
After all, translating names is more than just "converting alphabets." It is comparably easy in russian because every cyrillic character actually has a latin counterpart (they are sister alphabets).
I don't know about arabic, but for chinese you will need a romanization system like Pinyin to get anywhere. It's more complex than a simple replacing of characters.
Here's a full list of ISO Romanizations - If I understand correctly, a solution that works for you would have to implement those rules.
So the task would be:
Analyze a text containing numerous different character ranges
Identify every word for which character range it belongs to (อักษรไทย is Thai; Москва is cyrillic; and so on)
Apply the correct method of romanization to every word.
Now I'm very interested to hear about any libraries that can do this in PHP, but it is well possible that there are none.
In PHP5.3, Intl introduces a transliterator class, which is a wrapper around ICU. The following library has the full ISO rule set:
http://www.php.net/manual/en/transliterator.transliterate.php
Google has an AJAX transliteration API which does a good job on many major scripts.
Edit: Damn, it appears on further inspection that this only allows conversions from the Latin alphabet. It's kind of silly that Google hasn't made the reverse functionality available, since they're already using it in Google Translate to provide romanisations for Cyrillic, Chinese, Thai, Hindi, and others, though notably not abugidas such as Hebrew and Arabic.
Further Edit: I thought of a possible workaround: detect the language and use an AJAX query to run it through Google Translate using the same source language as destination language, e.g. Chinese-to-Chinese. Firebug reveals that the transliteration is output in a
div
whose ID istranslit
. Transliterations are typically heavily accented, so you'll need to convert them. This is by no means something to rely on (though Google typically doesn't make frequent structural changes to their HTML), but it is certainly an interesting possibility.