There are heaps of Qs about this on this forum and on the web in general. But I don't just get it.
Here is my code:
function updateGuideKeywords($dal)
{
$pattern = "/[^a-zA-Z-êàé]/";
$keywords = preg_replace($pattern, '', $_POST['keywords']);
echo json_encode($keywords);
}
Now, the input is Prêt-à-porter
, and the output is "Pr\u00eat-\u00e0-porter"
.
Why do I get the '\u00e' ?
And how can I alter my pattern to include the characters ê
, à
and é
?
EDIT
humm... since it looks like a unicode / character issue, I might go for the solution I found on this page.
Here they suggest doing something like this:
$chain="prêt-à-porter";
$pattern = array("'é'", "'è'", "'ë'", "'ê'", "'É'", "'È'", "'Ë'", "'Ê'", "'á'", "'à'", "'ä'", "'â'", "'å'", "'Á'", "'À'", "'Ä'", "'Â'", "'Å'", "'ó'", "'ò'", "'ö'", "'ô'", "'Ó'", "'Ò'", "'Ö'", "'Ô'", "'í'", "'ì'", "'ï'", "'î'", "'Í'", "'Ì'", "'Ï'", "'Î'", "'ú'", "'ù'", "'ü'", "'û'", "'Ú'", "'Ù'", "'Ü'", "'Û'", "'ý'", "'ÿ'", "'Ý'", "'ø'", "'Ø'", "'œ'", "'Œ'", "'Æ'", "'ç'", "'Ç'");
$replace = array('e', 'e', 'e', 'e', 'E', 'E', 'E', 'E', 'a', 'a', 'a', 'a', 'a', 'A', 'A', 'A', 'A', 'A', 'o', 'o', 'o', 'o', 'O', 'O', 'O', 'O', 'i', 'i', 'i', 'I', 'I', 'I', 'I', 'I', 'u', 'u', 'u', 'u', 'U', 'U', 'U', 'U', 'y', 'y', 'Y', 'o', 'O', 'a', 'A', 'A', 'c', 'C');
$chain = preg_replace($pattern, $replace, $chain);
EDIT 2
This is my solution so far:
function updateGuideKeywords()
{
//First we replace characters with accents
$pattern = array("'é'", "'è'", "'ë'", "'ê'", "'É'", "'È'", "'Ë'", "'Ê'", "'á'", "'à'", "'ä'", "'â'", "'å'", "'Á'", "'À'", "'Ä'", "'Â'", "'Å'", "'ó'", "'ò'", "'ö'", "'ô'", "'Ó'", "'Ò'", "'Ö'", "'Ô'", "'í'", "'ì'", "'ï'", "'î'", "'Í'", "'Ì'", "'Ï'", "'Î'", "'ú'", "'ù'", "'ü'", "'û'", "'Ú'", "'Ù'", "'Ü'", "'Û'", "'ý'", "'ÿ'", "'Ý'", "'ø'", "'Ø'", "'œ'", "'Œ'", "'Æ'", "'ç'", "'Ç'");
$replace = array('e', 'e', 'e', 'e', 'E', 'E', 'E', 'E', 'a', 'a', 'a', 'a', 'a', 'A', 'A', 'A', 'A', 'A', 'o', 'o', 'o', 'o', 'O', 'O', 'O', 'O', 'i', 'i', 'i', 'I', 'I', 'I', 'I', 'I', 'u', 'u', 'u', 'u', 'U', 'U', 'U', 'U', 'y', 'y', 'Y', 'o', 'O', 'a', 'A', 'A', 'c', 'C'); $shguideID = $_POST['shguideID'];
$keywords = preg_replace($pattern, $replace, $_POST['keywords']);
//Then we remove unwanted characters by only allowing a-z, A-Z, comma, 'minus' and white space
$keywords = preg_replace("/[^a-zA-Z-,\s]/", "", $keywords);
echo json_encode($keywords);
}
Your code, with the latest edits so far, works this way:
The expression
/[^a-zA-Z-êàé]/
means "match anything that's not English letter, minus sign, ê, à or é".preg_replace($pattern, '', 'Prêt-à-porter')
returns 'Prêt-à-porter' since nothing matches.json_encode() returns the JSON representation of 'Prêt-à-porter', which is 'r\u00eat-\u00e0-porter'
It's not clear to me what's your exact goal. If you want to remove anything that's not a minus or letter you can try this pattern:
this may not be 100% accurate, but looking at the regex your using i don't think preg_replace() is the issue. I think the reason you are getting '\u00e' is due to php's poor support of character encodings.
You could also use mb_ereg_replace() to work with multibyte characters in your string.
"Pr\u00eat-\u00e0-porter"
is a correct JavaScript string literal representation ofPrêt-à-porter
. I assume you're doing ajson_encode
at some point along the line?Note also that PHP's regular expressions are not Unicode-aware, so if you are using UTF-8 (which generally you want to be), the character
ê
is not a single character, but byte C3 followed by byte AA. That's fine for simple literal matches, but in situations like a character class you're now matching two bytes separately instead of one after each other, which can easily mess up your expression.If you want to replace 'é' with 'e', etc. use iconv() with the //TRANSLIT modifier
e.g.,
A more complete example:
From what I see of your output, your characters are not removed (hence in your pattern), so the only thing is that the output is made in unicode. Try to change your document to UTF-8 or encode HTML entities and it should work, but beware if you encode entities before replacing, it won't detect them as they will be already converted.