we have this code:
$value = preg_replace("/[^\w]/", '', $value);
where $value
is in utf-8. After this transformation first byte of multibyte characters is stripped. How to make \w cover UTF-8 chars completely?
Sorry, i am not very well in PHP
we have this code:
$value = preg_replace("/[^\w]/", '', $value);
where $value
is in utf-8. After this transformation first byte of multibyte characters is stripped. How to make \w cover UTF-8 chars completely?
Sorry, i am not very well in PHP
try this function instead...http://php.net/manual/en/function.mb-ereg-replace.php
There is this nasty
u
modifier to pcre patterns in PHP. It states that the regex is encoded in UTF8, but I found that it treats the input as UTF8, too.You could try with the /u modifier:
If that won't do, try
mb_ereg_replace
- Replace regular expression with multibyte supportinstead.
Use
[^\w]+
instead of[^\w]
You can also use
\W
in place of[^\w]
Append
u
to regex, to turn on the multibyte unicode mode of PCRE:Corollary
In unicode mode, PCRE expects everything is multibyte and if it is not then there will be problems meeting deadlines. Therefore, to convert anything to UTF-8 (and drop any unconvertible junk), we first use:
to clean and prep the input.
Because everything can be encoded into ISO-8859-1 (even if some obscure characters appear incorrectly), and since most web browsers run natively in 8859 (unless told to use UTF-8), we've found this function as a general, safe, effective method to 'take anything, drop any junk, and convert into UTF-8'.
mb_ereg_* is deprecated as of 5.3.0 -- so using those functions is not the right way to go.