可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
The Java API for regular expressions states that \s
will match whitespace. So the regex \\s\\s
should match two spaces.
Pattern whitespace = Pattern.compile("\\s\\s");
matcher = whitespace.matcher(modLine);
while (matcher.find()) matcher.replaceAll(" ");
The aim of this is to replace all instances of two consecutive whitespace with a single space. However this does not actually work.
Am I having a grave misunderstanding of regexes or the term "whitespace"?
回答1:
Yeah, you need to grab the result of matcher.replaceAll():
String result = matcher.replaceAll(" ");
System.out.println(result);
回答2:
You can’t use \s
in Java to match white space on its own native character set, because Java doesn’t support the Unicode white space property — even though doing so is strictly required to meet UTS#18’s RL1.2! What it does have is not standards-conforming, alas.
Unicode defines 26 code points as \p{White_Space}
: 20 of them are various sorts of \pZ
GeneralCategory=Separator, and the remaining 6 are \p{Cc}
GeneralCategory=Control.
White space is a pretty stable property, and those same ones have been around virtually forever. Even so, Java has no property that conforms to The Unicode Standard for these, so you instead have to use code like this:
String whitespace_chars = "" /* dummy empty string for homogeneity */
+ "\\u0009" // CHARACTER TABULATION
+ "\\u000A" // LINE FEED (LF)
+ "\\u000B" // LINE TABULATION
+ "\\u000C" // FORM FEED (FF)
+ "\\u000D" // CARRIAGE RETURN (CR)
+ "\\u0020" // SPACE
+ "\\u0085" // NEXT LINE (NEL)
+ "\\u00A0" // NO-BREAK SPACE
+ "\\u1680" // OGHAM SPACE MARK
+ "\\u180E" // MONGOLIAN VOWEL SEPARATOR
+ "\\u2000" // EN QUAD
+ "\\u2001" // EM QUAD
+ "\\u2002" // EN SPACE
+ "\\u2003" // EM SPACE
+ "\\u2004" // THREE-PER-EM SPACE
+ "\\u2005" // FOUR-PER-EM SPACE
+ "\\u2006" // SIX-PER-EM SPACE
+ "\\u2007" // FIGURE SPACE
+ "\\u2008" // PUNCTUATION SPACE
+ "\\u2009" // THIN SPACE
+ "\\u200A" // HAIR SPACE
+ "\\u2028" // LINE SEPARATOR
+ "\\u2029" // PARAGRAPH SEPARATOR
+ "\\u202F" // NARROW NO-BREAK SPACE
+ "\\u205F" // MEDIUM MATHEMATICAL SPACE
+ "\\u3000" // IDEOGRAPHIC SPACE
;
/* A \s that actually works for Java’s native character set: Unicode */
String whitespace_charclass = "[" + whitespace_chars + "]";
/* A \S that actually works for Java’s native character set: Unicode */
String not_whitespace_charclass = "[^" + whitespace_chars + "]";
Now you can use whitespace_charclass + "+"
as the pattern in your replaceAll
.
=begin soapbox
Sorry ’bout all that. Java’s regexes just don’t work very well on its own native character set, and so you really have to jump through exotic hoops to make them work.
And if you think white space is bad, you should see what you have to do to get \w
and \b
to finally behave properly!
Yes, it’s possible, and yes, it’s a mindnumbing mess. That’s being charitable, even. The easiest way to get a standards-comforming regex library for Java is to JNI over to ICU’s stuff. That’s what Google does for Android, because OraSun’s doesn’t measure up.
If you don’t want to do that but still want to stick with Java, I have a front-end regex rewriting library I wrote that “fixes” Java’s patterns, at least to get them conform to the requirements of RL1.2a in UTS#18, Unicode Regular Expressions.
=end soapbox
回答3:
For Java (not php, not javascript, not anyother):
txt.replaceAll("\\p{javaSpaceChar}{2,}"," ")
回答4:
when I sended a question to a Regexbuddy (regex developer application) forum, I got more exact reply to my \s Java question:
"Message author: Jan Goyvaerts
In Java, the shorthands \s, \d, and \w only include ASCII characters. ... This is not a bug in Java, but simply one of the many things you need to be aware of when working with regular expressions. To match all Unicode whitespace as well as line breaks, you can use [\s\p{Z}] in Java. RegexBuddy does not yet support Java-specific properties such as \p{javaSpaceChar} (which matches the exact same characters as [\s\p{Z}]).
... \s\s will match two spaces, if the input is ASCII only. The real problem is with the OP's code, as is pointed out by the accepted answer in that question."
回答5:
Seems to work for me:
String s = " a b c";
System.out.println("\"" + s.replaceAll("\\s\\s", " ") + "\"");
will print:
" a b c"
I think you intended to do this instead of your code:
Pattern whitespace = Pattern.compile("\\s\\s");
Matcher matcher = whitespace.matcher(s);
String result = "";
if (matcher.find()) {
result = matcher.replaceAll(" ");
}
System.out.println(result);
回答6:
Pattern whitespace = Pattern.compile("\\s\\s");
matcher = whitespace.matcher(modLine);
boolean flag = true;
while(flag)
{
//Update your original search text with the result of the replace
modLine = matcher.replaceAll(" ");
//reset matcher to look at this "new" text
matcher = whitespace.matcher(modLine);
//search again ... and if no match , set flag to false to exit, else run again
if(!matcher.find())
{
flag = false;
}
}
回答7:
For your purpose you can use this snnippet:
import org.apache.commons.lang3.StringUtils;
StrintUtils.StringUtils.normalizeSpace(string);
this will normalize the spacing to single and will strip off the starting and trailing whitespaces as well.
回答8:
Use of whitespace in RE is a pain, but I believe they work. The OP's problem can also be solved using StringTokenizer or the split() method. However, to use RE (uncomment the println() to view how the matcher is breaking up the String), here is a sample code:
import java.util.regex.*;
public class Two21WS {
private String str = "";
private Pattern pattern = Pattern.compile ("\\s{2,}"); // multiple spaces
public Two21WS (String s) {
StringBuffer sb = new StringBuffer();
Matcher matcher = pattern.matcher (s);
int startNext = 0;
while (matcher.find (startNext)) {
if (startNext == 0)
sb.append (s.substring (0, matcher.start()));
else
sb.append (s.substring (startNext, matcher.start()));
sb.append (" ");
startNext = matcher.end();
//System.out.println ("Start, end = " + matcher.start()+", "+matcher.end() +
// ", sb: \"" + sb.toString() + "\"");
}
sb.append (s.substring (startNext));
str = sb.toString();
}
public String toString () {
return str;
}
public static void main (String[] args) {
String tester = " a b cdef gh ij kl";
System.out.println ("Initial: \"" + tester + "\"");
System.out.println ("Two21WS: \"" + new Two21WS(tester) + "\"");
}}
It produces the following (compile with javac and run at the command prompt):
% java Two21WS
Initial: " a b cdef gh ij kl"
Two21WS: " a b cdef gh ij kl"