I am using php and I was wondering if there was a predefined way to convert foreign characters to their non-foreign alternatives.
Characters such as ê, ë, é all resulting to 'e'.
I'm looking for a function that would take a string and return it without the special characters.
Any ideas would be greatly appreciated!
Try
iconv()
http://www.php.net/manual/en/function.iconv.php with the//TRANSLIT
option, orrecode_string()
http://www.php.net/manual/en/function.recode-string.php, ormb_convert_encoding()
http://www.php.net/manual/en/function.mb-convert-encoding.phpAfter failing to find suitable convertors I created my own collection that suits my needs including my favorite Cyrillic conversion that by default has numerous variations.
My first recommendation is the iconv function. Namely because it's built into PHP, so doesn't require any external or 3rd party libraries. In addition, it's a function that's designed to do precisely what you are trying to accomplish (accept on character set as input, and output an alternate character set, specifically going from UTF-8 to ASCII). Below is an example of how to call this function:
More information about the specifics of this PHP function can be found here: http://php.net/manual/en/function.iconv.php
Note: The iconv function accepts string inputs, so you'll want to iterate over data, and parse it such that you are passing in a string input.
I coded this function which uses the HTML entities translation table built-in into PHP to romanize chars:
It works by applying
htmlentities()
and then removing common entities suffixes, a simple example:Beware that for this to work properly your files need to be encoded in UTF-8 (no BOM obviously).
See also my other answer for another example.
I hope this will be useful for anybody: https://github.com/infralabs/DiacriticsRemovePHP
This class removes diacritics from strings containing Latin-1 Supplement, Latin Extended-A and Latin Extended-B special characters.
usage:
source:
result:
The most generic way to solve this is to use Unicode Normalization as it works automatically on all accents - you don't have to prepare the list up front. I don't know if it's easily available in PHP, I have used it in C and Java. Essentially, you first transform the string so that all accented characters are represented by regular character plus so-called composing diacritical mark (a built-in or external library should provide this function), and then remove the composing diacritics (using a specialized library, using character properties the language provides or using some regular expression extensions).