Java Replacing multiple different substring in a s

2019-01-01 15:10发布

I need to replace many different sub-string in a string in the most efficient way. is there another way other then the brute force way of replacing each field using string.replace ?

10条回答
时光乱了年华
2楼-- · 2019-01-01 15:43

The below is based on Todd Owen's answer. That solution has the problem that if the replacements contain characters that have special meaning in regular expressions, you can get unexpected results. I also wanted to be able to optionally do a case-insensitive search. Here is what I came up with:

/**
 * Performs simultaneous search/replace of multiple strings. Case Sensitive!
 */
public String replaceMultiple(String target, Map<String, String> replacements) {
  return replaceMultiple(target, replacements, true);
}

/**
 * Performs simultaneous search/replace of multiple strings.
 * 
 * @param target        string to perform replacements on.
 * @param replacements  map where key represents value to search for, and value represents replacem
 * @param caseSensitive whether or not the search is case-sensitive.
 * @return replaced string
 */
public String replaceMultiple(String target, Map<String, String> replacements, boolean caseSensitive) {
  if(target == null || "".equals(target) || replacements == null || replacements.size() == 0)
    return target;

  //if we are doing case-insensitive replacements, we need to make the map case-insensitive--make a new map with all-lower-case keys
  if(!caseSensitive) {
    Map<String, String> altReplacements = new HashMap<String, String>(replacements.size());
    for(String key : replacements.keySet())
      altReplacements.put(key.toLowerCase(), replacements.get(key));

    replacements = altReplacements;
  }

  StringBuilder patternString = new StringBuilder();
  if(!caseSensitive)
    patternString.append("(?i)");

  patternString.append('(');
  boolean first = true;
  for(String key : replacements.keySet()) {
    if(first)
      first = false;
    else
      patternString.append('|');

    patternString.append(Pattern.quote(key));
  }
  patternString.append(')');

  Pattern pattern = Pattern.compile(patternString.toString());
  Matcher matcher = pattern.matcher(target);

  StringBuffer res = new StringBuffer();
  while(matcher.find()) {
    String match = matcher.group(1);
    if(!caseSensitive)
      match = match.toLowerCase();
    matcher.appendReplacement(res, replacements.get(match));
  }
  matcher.appendTail(res);

  return res.toString();
}

Here are my unit test cases:

@Test
public void replaceMultipleTest() {
  assertNull(ExtStringUtils.replaceMultiple(null, null));
  assertNull(ExtStringUtils.replaceMultiple(null, Collections.<String, String>emptyMap()));
  assertEquals("", ExtStringUtils.replaceMultiple("", null));
  assertEquals("", ExtStringUtils.replaceMultiple("", Collections.<String, String>emptyMap()));

  assertEquals("folks, we are not sane anymore. with me, i promise you, we will burn in flames", ExtStringUtils.replaceMultiple("folks, we are not winning anymore. with me, i promise you, we will win big league", makeMap("win big league", "burn in flames", "winning", "sane")));

  assertEquals("bcaacbbcaacb", ExtStringUtils.replaceMultiple("abccbaabccba", makeMap("a", "b", "b", "c", "c", "a")));
  assertEquals("bcaCBAbcCCBb", ExtStringUtils.replaceMultiple("abcCBAabCCBa", makeMap("a", "b", "b", "c", "c", "a")));
  assertEquals("bcaacbbcaacb", ExtStringUtils.replaceMultiple("abcCBAabCCBa", makeMap("a", "b", "b", "c", "c", "a"), false));

  assertEquals("c colon  backslash temp backslash  star  dot  star ", ExtStringUtils.replaceMultiple("c:\\temp\\*.*", makeMap(".", " dot ", ":", " colon ", "\\", " backslash ", "*", " star "), false));
}

private Map<String, String> makeMap(String ... vals) {
  Map<String, String> map = new HashMap<String, String>(vals.length / 2);
  for(int i = 1; i < vals.length; i+= 2)
    map.put(vals[i-1], vals[i]);
  return map;
}
查看更多
千与千寻千般痛.
3楼-- · 2019-01-01 15:43

This worked for me:

String result = input.replaceAll("string1|string2|string3","replacementString");

Example:

String input = "applemangobananaarefriuits";
String result = input.replaceAll("mango|are|ts","-");
System.out.println(result);

Output: apple-banana-friui-

查看更多
笑指拈花
4楼-- · 2019-01-01 15:44

If you are going to be changing a String many times, then it is usually more efficient to use a StringBuilder (but measure your performance to find out):

String str = "The rain in Spain falls mainly on the plain";
StringBuilder sb = new StringBuilder(str);
// do your replacing in sb - although you'll find this trickier than simply using String
String newStr = sb.toString();

Every time you do a replace on a String, a new String object is created, because Strings are immutable. StringBuilder is mutable, that is, it can be changed as much as you want.

查看更多
梦醉为红颜
5楼-- · 2019-01-01 15:49

If the string you are operating on is very long, or you are operating on many strings, then it could be worthwhile using a java.util.regex.Matcher (this requires time up-front to compile, so it won't be efficient if your input is very small or your search pattern changes frequently).

Below is a full example, based on a list of tokens taken from a map. (Uses StringUtils from Apache Commons Lang).

Map<String,String> tokens = new HashMap<String,String>();
tokens.put("cat", "Garfield");
tokens.put("beverage", "coffee");

String template = "%cat% really needs some %beverage%.";

// Create pattern of the format "%(cat|beverage)%"
String patternString = "%(" + StringUtils.join(tokens.keySet(), "|") + ")%";
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(template);

StringBuffer sb = new StringBuffer();
while(matcher.find()) {
    matcher.appendReplacement(sb, tokens.get(matcher.group(1)));
}
matcher.appendTail(sb);

System.out.println(sb.toString());

Once the regular expression is compiled, scanning the input string is generally very quick (although if your regular expression is complex or involves backtracking then you would still need to benchmark in order to confirm this!)

查看更多
浮光初槿花落
6楼-- · 2019-01-01 15:53

Algorithm

One of the most efficient ways to replace matching strings (without regular expressions) is to use the Aho-Corasick algorithm with a performant Trie (pronounced "try"), fast hashing algorithm, and efficient collections implementation.

Simple Code

Perhaps the simplest code to write leverages Apache's StringUtils.replaceEach as follows:

  private String testStringUtils(
    final String text, final Map<String, String> definitions ) {
    final String[] keys = keys( definitions );
    final String[] values = values( definitions );

    return StringUtils.replaceEach( text, keys, values );
  }

This slows down on large texts.

Fast Code

Bor's implementation of the Aho-Corasick algorithm introduces a bit more complexity that becomes an implementation detail by using a façade with the same method signature:

  private String testBorAhoCorasick(
    final String text, final Map<String, String> definitions ) {
    // Create a buffer sufficiently large that re-allocations are minimized.
    final StringBuilder sb = new StringBuilder( text.length() << 1 );

    final TrieBuilder builder = Trie.builder();
    builder.onlyWholeWords();
    builder.removeOverlaps();

    final String[] keys = keys( definitions );

    for( final String key : keys ) {
      builder.addKeyword( key );
    }

    final Trie trie = builder.build();
    final Collection<Emit> emits = trie.parseText( text );

    int prevIndex = 0;

    for( final Emit emit : emits ) {
      final int matchIndex = emit.getStart();

      sb.append( text.substring( prevIndex, matchIndex ) );
      sb.append( definitions.get( emit.getKeyword() ) );
      prevIndex = emit.getEnd() + 1;
    }

    // Add the remainder of the string (contains no more matches).
    sb.append( text.substring( prevIndex ) );

    return sb.toString();
  }

Benchmarks

For the benchmarks, the buffer was created using randomNumeric as follows:

  private final static int TEXT_SIZE = 1000;
  private final static int MATCHES_DIVISOR = 10;

  private final static StringBuilder SOURCE
    = new StringBuilder( randomNumeric( TEXT_SIZE ) );

Where MATCHES_DIVISOR dictates the number of variables to inject:

  private void injectVariables( final Map<String, String> definitions ) {
    for( int i = (SOURCE.length() / MATCHES_DIVISOR) + 1; i > 0; i-- ) {
      final int r = current().nextInt( 1, SOURCE.length() );
      SOURCE.insert( r, randomKey( definitions ) );
    }
  }

The benchmark code itself (JMH seemed overkill):

long duration = System.nanoTime();
final String result = testBorAhoCorasick( text, definitions );
duration = System.nanoTime() - duration;
System.out.println( elapsed( duration ) );

1,000,000 : 1,000

A simple micro-benchmark with 1,000,000 characters and 1,000 randomly-placed strings to replace.

  • testStringUtils: 25 seconds, 25533 millis
  • testBorAhoCorasick: 0 seconds, 68 millis

No contest.

10,000 : 1,000

Using 10,000 characters and 1,000 matching strings to replace:

  • testStringUtils: 1 seconds, 1402 millis
  • testBorAhoCorasick: 0 seconds, 37 millis

The divide closes.

1,000 : 10

Using 1,000 characters and 10 matching strings to replace:

  • testStringUtils: 0 seconds, 7 millis
  • testBorAhoCorasick: 0 seconds, 19 millis

For short strings, the overhead of setting up Aho-Corasick eclipses the brute-force approach by StringUtils.replaceEach.

A hybrid approach based on text length is possible, to get the best of both implementations.

Implementations

Consider comparing other implementations for text longer than 1 MB, including:

Papers

Papers and information relating to the algorithm:

查看更多
何处买醉
7楼-- · 2019-01-01 15:57

StringBuilder will perform replace more efficiently, since its character array buffer can be specified to a required length.StringBuilder is designed for more than appending!

Of course the real question is whether this is an optimisation too far ? The JVM is very good at handling creation of multiple objects and the subsequent garbage collection, and like all optimisation questions, my first question is whether you've measured this and determined that it's a problem.

查看更多
登录 后发表回答