UTF-8 in PHP regular expressions [duplicate]

2019-01-06 20:10发布

问题:

This question already has an answer here:

  • Matching Unicode letter characters in PCRE/PHP 3 answers

I need help with regular expressions. My string contains unicode characters and code below doesn't work.

First four characters must be numbers, then comma and then any alphabetic characters or whitespaces... I already read that if i add /u on end of regular expresion but it didn't work for me...

My code works with non-unicode characters

$post = '9999,škofja loka';;
echo preg_match('/^[0-9]{4},[\s]*[a-zA-Z]+', $post);

Thanks for your answers!

回答1:

Updated answer:
This is now tested and working

$post = '9999, škofja loka';
echo preg_match('/^\\d{4},[\\s\\p{L}]+$/u', $post);

\\w will not work, because it does not contain all unicode letters and contains also [0-9_] additionally to the letters.

Important is also the u modifier to activate the unicode mode.

If there can be letters or whitespace after the comma then you should put those into the same character class, in your regex there are 0 or more whitespace after the comma and then there are only letters.

See http://www.regular-expressions.info/php.html for php regex details

The \\p{L} (Unicode letter) is explained here

Important is also the use of the end of string boundary $ to ensure that really the complete string is verified, otherwise it will match only the first whitespace and ignore the rest for example.



回答2:

[a-zA-Z] will match only letters in the range of a-z and A-Z. You have non-US-ASCII letters, and therefore your regex won't match, regardless of the /u modifier. You need to use the word character escape sequence (\w).

$post = '9999,škofja loka';
echo preg_match('/^[0-9]{4},[\s]*[\w]+/u', $post);


回答3:

The problem is your regular expression. You are explicitly saying that you will only accept a b c ... z A B C ... Z. š is not in the a-z set. Remember, š is as different to s as any other character.

So if you really just want a sequence of letters, then you need to test for the unicode properties. e.g.

echo preg_match('/^[0-9]{4},[\s]*\p{L}+', $post);

That shouuld work because \p{L} matches any unicode character which is considered a letter. Not just A through Z.



回答4:

Add a u, and remember the trailing slash:

echo preg_match('/^[0-9]{4},[\s]*[a-zA-Z]+/u', $post);

Edited:

echo preg_match('/^\d{4},(?:\s|\w)+/u', $post);


标签: php regex utf-8