Unicode regexp to match line-breaks?

I have this form from where I want to submit data to a database. The data is UTF8. I am having trouble with matching line breaks. The pattern I am using is something like this:

~^[\p{L}\p{M}\p{N} ]+$~u

This pattern works fine until the user puts a new line in his text box. I have tried using \p{Z} inside the class but with no success. I also tried "s" but it didn’t work.

Any help is much appreciated. Thanks!

标签： regex unicode character-properties line-breaks

1条回答

家丑人穷心不美

2楼-- · 2020-04-16 06:56

A Unicode linebreak is either a carriage return immediately followed by a line feed, or else it is any character with the vertical whitespace property.

But it looks like you’re trying to match generic whitespace there. In Java, that would be

 [\u000A\u000B\u000C\u000D\u0020\u0085\u00A0\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u2028\u2029\u202F\u205F\u3000]

which can be shortened by using ranges to “only” this:

 [\u000A-\u000D\u0020\u0085\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000]

to include both horizontal whitespace (\h) and vertical whitespace (\v), which may or may not be the same as general whitespace (\s).

It also looks like you’re trying to match alphanumerics.

Alphabetics alone are usually [\pL\pM\p{Nl}].
Numerics are not so often all \pN as often as they are either just \p{Nd} or else sometimes [\p{Nd}\p{Nl}].
Identifer characters need connector punctuation and a bit more, so [\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]] — if your regex engine supports those sorts of operations (Java’s does). That’s what \w works out to in Unicode-aware regex languages (of which Java is not one).

In older versions of Perl, you would likely write a linebreak as

 (?:\r\n|\p{VertSpace})

although that’s now better written as

 (?:(?>\r\n)|\v)

which is exactly what

\R

matches.

Java is very clumsy at these things. There you must write a linebreak as

  (?:(?>\u000D\u000A)|[\u000A-\u000D\u0085\u2028\u2029])

which of course requires extra bbaacckkssllasshheess when written as a string.

The other Java equivalences for the 14 common character-class regex escapes so that they work with Unicode I give in this answer. You may have to use those in other Java-like regex languages that aren’t sufficiently Unicode-aware.

0人赞添加讨论(0) 举报

Unicode regexp to match line-breaks?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间