I have this form from where I want to submit data to a database. The data is UTF8. I am having trouble with matching line breaks. The pattern I am using is something like this:
~^[\p{L}\p{M}\p{N} ]+$~u
This pattern works fine until the user puts a new line in his text box. I have tried using \p{Z}
inside the class but with no success. I also tried "s" but it didn’t work.
Any help is much appreciated. Thanks!
A Unicode linebreak is either a carriage return immediately followed by a line feed, or else it is any character with the vertical whitespace property.
But it looks like you’re trying to match generic whitespace there. In Java, that would be
which can be shortened by using ranges to “only” this:
to include both horizontal whitespace (
\h
) and vertical whitespace (\v
), which may or may not be the same as general whitespace (\s
).It also looks like you’re trying to match alphanumerics.
[\pL\pM\p{Nl}]
.\pN
as often as they are either just\p{Nd}
or else sometimes[\p{Nd}\p{Nl}]
.[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]
— if your regex engine supports those sorts of operations (Java’s does). That’s what\w
works out to in Unicode-aware regex languages (of which Java is not one).In older versions of Perl, you would likely write a linebreak as
although that’s now better written as
which is exactly what
matches.
Java is very clumsy at these things. There you must write a linebreak as
which of course requires extra bbaacckkssllasshheess when written as a string.
The other Java equivalences for the 14 common character-class regex escapes so that they work with Unicode I give in this answer. You may have to use those in other Java-like regex languages that aren’t sufficiently Unicode-aware.