Unicode regexp to match line-breaks?

2020-04-16 06:32发布

I have this form from where I want to submit data to a database. The data is UTF8. I am having trouble with matching line breaks. The pattern I am using is something like this:

~^[\p{L}\p{M}\p{N} ]+$~u

This pattern works fine until the user puts a new line in his text box. I have tried using \p{Z} inside the class but with no success. I also tried "s" but it didn’t work.

Any help is much appreciated. Thanks!

1条回答
家丑人穷心不美
2楼-- · 2020-04-16 06:56

A Unicode linebreak is either a carriage return immediately followed by a line feed, or else it is any character with the vertical whitespace property.

But it looks like you’re trying to match generic whitespace there. In Java, that would be

 [\u000A\u000B\u000C\u000D\u0020\u0085\u00A0\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u2028\u2029\u202F\u205F\u3000]

which can be shortened by using ranges to “only” this:

 [\u000A-\u000D\u0020\u0085\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000]

to include both horizontal whitespace (\h) and vertical whitespace (\v), which may or may not be the same as general whitespace (\s).

It also looks like you’re trying to match alphanumerics.

  • Alphabetics alone are usually [\pL\pM\p{Nl}].
  • Numerics are not so often all \pN as often as they are either just \p{Nd} or else sometimes [\p{Nd}\p{Nl}].
  • Identifer characters need connector punctuation and a bit more, so [\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]] — if your regex engine supports those sorts of operations (Java’s does). That’s what \w works out to in Unicode-aware regex languages (of which Java is not one).

In older versions of Perl, you would likely write a linebreak as

 (?:\r\n|\p{VertSpace})

although that’s now better written as

 (?:(?>\r\n)|\v)

which is exactly what

 \R

matches.

Java is very clumsy at these things. There you must write a linebreak as

  (?:(?>\u000D\u000A)|[\u000A-\u000D\u0085\u2028\u2029])

which of course requires extra bbaacckkssllasshheess when written as a string.

The other Java equivalences for the 14 common character-class regex escapes so that they work with Unicode I give in this answer. You may have to use those in other Java-like regex languages that aren’t sufficiently Unicode-aware.

查看更多
登录 后发表回答