I'm not good with regex but i want to use it to extract words from a string.
The words i need should have minimum 4 characters and the provided string can be utf8.
Example string:
Sus azahares presentan gruesos pétalos blancos teñidos de rosa o violáceo en la parte externa, con numerosos estambres (20-40).
Desired output:
Array(
[0] => azahares
[1] => presentan
[2] => gruesos
[3] => pétalos
[4] => blancos
[5] => teñidos
[6] => rosa
[7] => violáceo
[8] => parte
[9] => externa
[10] => numerosos
[11] => estambres
)
When you use the
u
modifier, you can use the following pattern (demo):The
u
modifier means:that should do the job for you
Try this one:
This works if the words to look for are UTF-8 (at least 4 chars long, as per specs), consisting of alphabetic characters of ISO-8859-15 (which is fine for Spanish, but also for English, German, French, etc.):
You can use the regex below for simple strings. It will match any non-whitespace characters with min length = 4.
Now
$m[1]
contains the array you want.Update:
As Gordon said, the pattern will also match the '(20-40)'. The unwanted numbers can be removed using this regex:
But I think it only works if PCRE is compiled with UTF-8 support. See PHP PCRE (regex) doesn't support UTF-8?. It works on my computer though.
and so on