I need to replace many different sub-string in a string in the most efficient way. is there another way other then the brute force way of replacing each field using string.replace ?
相关问题
- Delete Messages from a Topic in Apache Kafka
- Jackson Deserialization not calling deserialize on
- How to maintain order of key-value in DataFrame sa
- StackExchange API - Deserialize Date in JSON Respo
- Difference between Types.INTEGER and Types.NULL in
The below is based on Todd Owen's answer. That solution has the problem that if the replacements contain characters that have special meaning in regular expressions, you can get unexpected results. I also wanted to be able to optionally do a case-insensitive search. Here is what I came up with:
Here are my unit test cases:
This worked for me:
Example:
Output: apple-banana-friui-
If you are going to be changing a String many times, then it is usually more efficient to use a StringBuilder (but measure your performance to find out):
Every time you do a replace on a String, a new String object is created, because Strings are immutable. StringBuilder is mutable, that is, it can be changed as much as you want.
If the string you are operating on is very long, or you are operating on many strings, then it could be worthwhile using a java.util.regex.Matcher (this requires time up-front to compile, so it won't be efficient if your input is very small or your search pattern changes frequently).
Below is a full example, based on a list of tokens taken from a map. (Uses StringUtils from Apache Commons Lang).
Once the regular expression is compiled, scanning the input string is generally very quick (although if your regular expression is complex or involves backtracking then you would still need to benchmark in order to confirm this!)
Algorithm
One of the most efficient ways to replace matching strings (without regular expressions) is to use the Aho-Corasick algorithm with a performant Trie (pronounced "try"), fast hashing algorithm, and efficient collections implementation.
Simple Code
Perhaps the simplest code to write leverages Apache's
StringUtils.replaceEach
as follows:This slows down on large texts.
Fast Code
Bor's implementation of the Aho-Corasick algorithm introduces a bit more complexity that becomes an implementation detail by using a façade with the same method signature:
Benchmarks
For the benchmarks, the buffer was created using randomNumeric as follows:
Where
MATCHES_DIVISOR
dictates the number of variables to inject:The benchmark code itself (JMH seemed overkill):
1,000,000 : 1,000
A simple micro-benchmark with 1,000,000 characters and 1,000 randomly-placed strings to replace.
No contest.
10,000 : 1,000
Using 10,000 characters and 1,000 matching strings to replace:
The divide closes.
1,000 : 10
Using 1,000 characters and 10 matching strings to replace:
For short strings, the overhead of setting up Aho-Corasick eclipses the brute-force approach by
StringUtils.replaceEach
.A hybrid approach based on text length is possible, to get the best of both implementations.
Implementations
Consider comparing other implementations for text longer than 1 MB, including:
Papers
Papers and information relating to the algorithm:
StringBuilder
will perform replace more efficiently, since its character array buffer can be specified to a required length.StringBuilder
is designed for more than appending!Of course the real question is whether this is an optimisation too far ? The JVM is very good at handling creation of multiple objects and the subsequent garbage collection, and like all optimisation questions, my first question is whether you've measured this and determined that it's a problem.