String.replaceAll is considerably slower than doin

2019-01-14 15:18发布

问题:

I have an old piece of code that performs find and replace of tokens within a string.

It receives a map of from and to pairs, iterates over them and for each of those pairs, iterates over the target string, looks for the from using indexOf(), and replaces it with the value of to. It does all the work on a StringBuffer and eventually returns a String.

I replaced that code with this line: replaceAll("[,. ]*", "");
And I ran some comparative performance tests.
When comparing for 1,000,000 iterations, I got this:

Old Code: 1287ms
New Code: 4605ms

3 times longer!

I then tried replacing it with 3 calls to replace:
replace(",", "");
replace(".", "");
replace(" ", "");

This resulted with the following results:

Old Code: 1295
New Code: 3524

2 times longer!

Any idea why replace and replaceAll are so inefficient? Can I do something to make it faster?


Edit: Thanks for all the answers - the main problem was indeed that [,. ]* did not do what I wanted it to do. Changing it to be [,. ]+ almost equaled the performance of the non-Regex based solution. Using a pre-compiled regex helped, but was marginal. (It is a solution very applicable for my problem.

Test code:
Replace string with Regex: [,. ]*
Replace string with Regex: [,. ]+
Replace string with Regex: [,. ]+ and Pre-Compiled Pattern

回答1:

While using regular expressions imparts some performance impact, it should not be as terrible.

Note that using String.replaceAll() will compile the regular expression each time you call it.

You can avoid that by explicitly using a Pattern object:

Pattern p = Pattern.compile("[,. ]+");

// repeat only the following part:
String output = p.matcher(input).replaceAll("");

Note also that using + instead of * avoids replacing empty strings and therefore might also speed up the process.



回答2:

replace and replaceAll uses regex internally which in most cases gives a serious performance impact compared to e.g., StringUtils.replace(..).

String.replaceAll():

public String replaceAll(String regex, String replacement) {
        return Pattern.compile(regex).matcher(this ).replaceAll(
             replacement);
}

String.replace() uses Pattern.compile underneath.

public String replace(CharSequence target, CharSequence replacement) {
  return Pattern.compile(target.toString(), Pattern.LITERAL)
         .matcher(this ).replaceAll(
           Matcher.quoteReplacement(replacement.toString()));
}

Also see Replace all occurrences of substring in a string - which is more efficient in Java?



回答3:

As I have put in a comment [,. ]* matches the empty String "". So, every "space" between characters matches the pattern. It is only noted in performance because you are replacing a lot of "" by "".

Try doing this:

Pattern p = Pattern.compile("[,. ]*");
System.out.println(p.matcher("Hello World").replaceAll("$$$");

It returns:

H$$$e$$$l$$$o$$$$$$W$$$o$$$r$$$l$$$d$$$!$$$

No wonder it is slower that doing it "by hand"! You should try with [,. ]+



回答4:

When it comes to replaceAll("[,. ]*", "") it's not that big of a surprise since it relies on regular expressions. The regex engine creates an automaton which it runs over the input. Some overhead is expected.

The second approach (replace(",", "")...) also uses regular expressions internally. Here the given pattern is however compiled using Pattern.LITERAL so the regular expression overhead should be negligable.) In this case it is probably due to the fact that Strings are immutable (however small change you do, you will create a new string) and thus not as efficient as StringBuffers which manipulate the string in-place.