Match whitespace but not newlines

2019-01-02 14:25发布

I sometimes want to match whitespace but not newline.

So far I've been resorting to [ \t]. Is there a less awkward way?

标签: regex perl
6条回答
浮光初槿花落
2楼-- · 2019-01-02 14:53

A variation on Greg’s answer that includes carriage returns too:

/[^\S\r\n]/

This regex is safer than /[^\S\n]/ with no \r. My reasoning is that Windows uses \r\n for newlines, and Mac OS 9 used \r. You’re unlikely to find \r without \n nowadays, but if you do find it, it couldn’t mean anything but a newline. Thus, since \r can mean a newline, we should exclude it too.

查看更多
还给你的自由
3楼-- · 2019-01-02 15:03

The below regex would match white spaces but not of a new line character.

(?:(?!\n)\s)

DEMO

If you want to add carriage return also then add \r with the | operator inside the negative lookahead.

(?:(?![\n\r])\s)

DEMO

Add + after the non-capturing group to match one or more white spaces.

(?:(?![\n\r])\s)+

DEMO

I don't know why you people failed to mention the POSIX character class [[:blank:]] which matches any horizontal whitespaces (spaces and tabs). This POSIX chracter class would work on BRE(Basic REgular Expressions), ERE(Extended Regular Expression), PCRE(Perl Compatible Regular Expression).

DEMO

查看更多
何处买醉
4楼-- · 2019-01-02 15:04

Use a double-negative:

/[^\S\n]/

To avoid platform differences warned about in perlport regarding mappings of \r and \n:

/[^\S\x0a\x0d]/

That is, not-not-whitespace or not-newline and similar for the pattern that excludes CR and NL.

Distributing the outer not (i.e., the complementing ^ in the character class) with De Morgan's law, this is equivalent to “whitespace and not carriage return and not newline,” but don't take my word for it:

#! /usr/bin/env perl

use strict;
use warnings;

use 5.005;  # for qr//

my $ws_not_nl = qr/[^\S\x0a\x0d]/;

for (' ', '\f', '\t', '\r', '\n') {
  my $qq = qq["$_"];
  printf "%-4s => %s\n", $qq,
    (eval $qq) =~ $ws_not_nl ? "match" : "no match";
}

Output:

" "  => match
"\f" => match
"\t" => match
"\r" => no match
"\n" => no match

Note the exclusion of vertical tab, but this is addressed in v5.18.

This trick is also handy for matching alphabetic characters. Remember that \w matches “word characters,” alphabetic characters but also digits and underscore. We ugly-Americans sometimes want to write it as, say,

if (/^[A-Za-z]+$/) { ... }

but a double-negative character-class can respect the locale:

if (/^[^\W\d_]+$/) { ... }

That is a bit opaque, so a POSIX character-class may be better at expressing the intent

if (/^[[:alpha:]]+$/) { ... }

or as szbalint suggested

if (/^\p{Letter}+$/) { ... }
查看更多
浪荡孟婆
5楼-- · 2019-01-02 15:06

What you are looking for is the POSIX blank character class. In Perl it is referenced as:

[[:blank:]]

in Java (don't forget to enable UNICODE_CHARACTER_CLASS):

\p{Blank}

Compared to the similar \h, POSIX blank is supported by a few more regex engines (reference). A major benefit is that its definition is fixed in Annex C: Compatibility Properties of Unicode Regular Expressions and standard across all regex flavors that support Unicode. (In Perl, for example, \h chooses to additionally include the MONGOLIAN VOWEL SEPARATOR.) However, an argument in favor of \h is that it always detects Unicode characters (even if the engines don't agree on which), while POSIX character classes are often by default ASCII-only (as in Java).

But the problem is that even sticking to Unicode doesn't solve the issue 100%. Consider the following characters which are not considered whitespace in Unicode:

The aforementioned Mongolian vowel separator isn't included for what is probably a good reason. It, along with 200C and 200D, occur within words (AFAIK), and therefore breaks the cardinal rule that all other whitespace obeys: you can tokenize with it. They're more like modifiers. However, ZERO WIDTH SPACE, WORD JOINER, and ZERO WIDTH NON-BREAKING SPACE (if it used as other than a byte-order mark) fit the whitespace rule in my book. Therefore, I include them in my horizontal whitespace character class.

In Java:

static public final String HORIZONTAL_WHITESPACE = "[\\p{Blank}\\u200B\\u2060\\uFFEF]"
查看更多
宁负流年不负卿
6楼-- · 2019-01-02 15:07

Perl versions 5.10 and later support subsidiary vertical and horizontal character classes, \v and \h, as well as the generic whitespace character class \s

The cleanest solution is to use the horizontal whitespace character class \h. This will match tab and space from the ASCII set, non-breaking space from extended ASCII, or any of these Unicode characters

U+0009 CHARACTER TABULATION
U+0020 SPACE
U+00A0 NO-BREAK SPACE (not matched by \s)

U+1680 OGHAM SPACE MARK
U+2000 EN QUAD
U+2001 EM QUAD
U+2002 EN SPACE
U+2003 EM SPACE
U+2004 THREE-PER-EM SPACE
U+2005 FOUR-PER-EM SPACE
U+2006 SIX-PER-EM SPACE
U+2007 FIGURE SPACE
U+2008 PUNCTUATION SPACE
U+2009 THIN SPACE
U+200A HAIR SPACE
U+202F NARROW NO-BREAK SPACE
U+205F MEDIUM MATHEMATICAL SPACE
U+3000 IDEOGRAPHIC SPACE

The vertical space pattern \v is less useful, but matches these characters

U+000A LINE FEED
U+000B LINE TABULATION
U+000C FORM FEED
U+000D CARRIAGE RETURN
U+0085 NEXT LINE (not matched by \s)

U+2028 LINE SEPARATOR
U+2029 PARAGRAPH SEPARATOR

There are seven vertical whitespace characters which match \v and eighteen horizontal ones which match \h. \s matches twenty-three characters

All whitespace characters are either vertical or horizontal with no overlap, but they are not proper subsets because \h also matches U+00A0 NO-BREAK SPACE, and \v also matches U+0085 NEXT LINE, neither of which are matched by \s

查看更多
查无此人
7楼-- · 2019-01-02 15:15

m/ /g just give space in / /, and it will work. Or use \S — it will replace all the special characters like tab, newlines, spaces, and so on.

查看更多
登录 后发表回答