Fastest way to replace multiple strings in a huge

2019-03-08 19:53发布

问题:

I m looking for the fastest way to replace multiple (~500) substrings of a big (~1mb) string. Whatever I have tried it seems that String.Replace is the fastest way of doing it.

I just care about the fastest possible way. Not code readability, maintainability etc. I dont care if I need to use unsafe code or pre-process the original string either.

EDIT: After the comments I have added some more details:

Each replace iteration will replace ABC on the string with some other string (different per replace iteration). The string to replace will ALWAYS be the same - ABC will always be ABC. Never ABD. So if there are 400.000 thousands replace iterations. The same string - ABC - will be replaced with some other (different) string each time.

I can be in control on what ABC is. I can make it super-short or super-long as long as it doesn't affect the results. Clearly ABC can't be hello cause hello will exist as a word in most of the input strings.

Example input: ABCDABCABCDABCABCDABCABCDABCD

Example replace from string: BC

Example replace with strings: AA, BB, CC, DD, EE (5 iterations)

Example outputs:

AAADAAAAAADAAAAAADAAAAAADAAAD
ABBDABBABBDABBABBDABBABBDABBD
ACCDACCACCDACCACCDACCACCDACCD
ADDDADDADDDADDADDDADDADDDADDD
AEEDAEEAEEDAEEAEEDAEEAEEDAEED

Average case: Input string is 100-200kb with 40.000 replace iterations. Worst case: Input string is 1-2mb with 400.000 replace iterations.

I can do ANYTHING. Do it in parallel, do it unsafe, etc. It doesnt matter how I do it. What matters is that it needs to be as fast as it gets.

Thanks

回答1:

As I were mildly interested in this problem, I crafted few solutions. With hardcore optimizations it's possible to go down even more.

To get the latest source: https://github.com/ChrisEelmaa/StackOverflow/blob/master/FastReplacer.cs

And the output

-------------------------------------------------------
| Implementation       | Average | Separate runs      |
|----------------------+---------+--------------------|
| Simple               |    3485 | 9002, 4497, 443, 0 |
| SimpleParallel       |    1298 | 3440, 1606, 146, 0 |
| ParallelSubstring    |     470 | 1259, 558, 64, 0   |
| Fredou unsafe        |     356 | 953, 431, 41, 0    |
| Unsafe+unmanaged_mem |      92 | 229, 114, 18, 8    |
-------------------------------------------------------

You won't probably beat the .NET guys in crafting your own replace method, it's most likely already using unsafe. I do believe you can get it down by factor of two if you write it completely in C.

My implementations might be buggy, but you can get the general idea.



回答2:

Using unsafe and compiled as x64

result:

Implementation       | Exec   | GC
#1 Simple            | 4706ms |  0ms
#2 Simple parallel   | 2265ms |  0ms
#3 ParallelSubstring |  800ms | 21ms
#4 Fredou unsafe     |  432ms | 15ms

take the code of Erti-Chris Eelmaa and replace my previous one with this.

I don't think I will do another iteration but i did learn a few thing with unsafe which is a good thing :-)

    private unsafe static void FredouImplementation(string input, int inputLength, string replace, string[] replaceBy)
    {
        var indexes = new List<int>();

        //input = "ABCDABCABCDABCABCDABCABCDABCD";
        //inputLength = input.Length;
        //replaceBy = new string[] { "AA", "BB", "CC", "DD", "EE" };

        //my own string.indexof to save a few ms
        int len = inputLength;

        fixed (char* i = input, r = replace)
        {
            int replaceValAsInt = *((int*)r);

            while (--len > -1)
            {
                if (replaceValAsInt == *((int*)&i[len]))
                {
                    indexes.Add(len--);
                }
            }                
        }

        var idx = indexes.ToArray();
        len = indexes.Count;

        Parallel.For(0, replaceBy.Length, l =>
            Process(input, inputLength, replaceBy[l], idx, len)
        );
    }

    private unsafe static void Process(string input, int len, string replaceBy, int[] idx, int idxLen)
    {
        var output = new char[len];

        fixed (char* o = output, i = input, r = replaceBy)
        {
            int replaceByValAsInt = *((int*)r);

            //direct copy, simulate string.copy
            while (--len > -1)
            {
                o[len] = i[len];
            }

            while (--idxLen > -1)
            {
                ((int*)&o[idx[idxLen]])[0] = replaceByValAsInt;
            }
        }

        //Console.WriteLine(output);
    }


回答3:

It sounds like you are tokenising the string? I would look at producing a buffer and indexing your tokens. Or using a templating engine

As a naive example you could use code generation to make the following method

public string Produce(string tokenValue){

    var builder = new StringBuilder();
    builder.Append("A");
    builder.Append(tokenValue);
    builder.Append("D");

    return builder.ToString();

}

If your running the iterations enough times, the time to build the template will pay for itself. You can then also call that method in parallel with no side effects. Also look at interning your strings



回答4:

I made a variation on Fredou's code that requires less compares as it works on int* instead of char*. It still requires n iterations for a string of n length, it just has to do less comparing. You could have n/2 iterations if the string is neatly aligned by 2 (so the string to replace can only occur at indexes 0, 2, 4, 6, 8, etc) or even n/4 if it's aligned by 4 (you'd use long*). I'm not very good at bit fiddling like this, so someone might be able to find some obvious flaw in my code that could be more efficient. I verified that the result of my variation is the same as that of the simple string.Replace.

Additionally, I expect that some gains could be made in the 500x string.Copy that it does, but haven't looked into that yet.

My results (Fredou II):

IMPLEMENTATION       |  EXEC MS | GC MS
#1 Simple            |     6816 |     0
#2 Simple parallel   |     4202 |     0
#3 ParallelSubstring |    27839 |     4
#4 Fredou I          |     2103 |   106
#5 Fredou II         |     1334 |    91

So about 2/3 of the time (x86, but x64 was about the same).

For this code:

private unsafe struct TwoCharStringChunk
{
  public fixed char chars[2];
}

private unsafe static void FredouImplementation_Variation1(string input, int inputLength, string replace, TwoCharStringChunk[] replaceBy)
{
  var output = new string[replaceBy.Length];

  for (var i = 0; i < replaceBy.Length; ++i)
    output[i] = string.Copy(input);

  var r = new TwoCharStringChunk();
  r.chars[0] = replace[0];
  r.chars[1] = replace[1];

  _staticToReplace = r;

  Parallel.For(0, replaceBy.Length, l => Process_Variation1(output[l], input, inputLength, replaceBy[l]));
}

private static TwoCharStringChunk _staticToReplace ;

private static unsafe void Process_Variation1(string output, string input, int len, TwoCharStringChunk replaceBy)
{
  int n = 0;
  int m = len - 1;

  fixed (char* i = input, o = output, chars = _staticToReplace .chars)
  {
    var replaceValAsInt = *((int*)chars);
    var replaceByValAsInt = *((int*)replaceBy.chars);

    while (n < m)
    {
      var compareInput = *((int*)&i[n]);

      if (compareInput == replaceValAsInt)
      {
        ((int*)&o[n])[0] = replaceByValAsInt;
        n += 2;
      }
      else
      {
        ++n;
      }
    }
  }
}

The struct with the fixed buffer is not strictly necessary here and could have been replaced with a simple int field, but expand the char[2] to char[3] and this code can be made to work with three letter strings as well, which wouldn't be possible if it was an int field.

It required some changes to the Program.cs as well, so here's the full gist:

https://gist.github.com/JulianR/7763857

EDIT: I'm not sure why my ParallelSubstring is so slow. I'm running .NET 4 in Release mode, no debugger, in either x86 or x64.



回答5:

As your input string can be as long as 2Mb, I don't foresee any memory allocation problem. You can load everything in memory and replace your data.

If from BC you ALWAYS needs to replace for AA, a String.Replace will be ok. But, if you need more control, you could use a Regex.Replace:

var input  = "ABCDABCABCDABCABCDABCABCDABCD";
var output = Regex.Replace(input, "BC", (match) =>
{
    // here you can add some juice, like counters, etc
    return "AA";
});


回答6:

You probably won't get anything faster than String.Replace (unless you go native) because iirc String.Replace is implemented in CLR itself for maximum performance. If you want 100% performance, you can conveniently interface with native ASM code via C++/CLI and go from there.



回答7:

My approach is a little like templating - it takes the input string and pulls out (removes) the substrings that are to be replaced. Then it takes the remaining parts of the string (the template) and combines them with the new replacement substrings. This is done in a parallel operation (template + each replacement string), which builds the output strings.

I think what I am explaining above may be clearer with code. This uses your sample inputs from above:

const char splitter = '\t';   // use a char that will not appear in your string

string input = "ABCDABCABCDABCABCDABCABCDABCD";
string oldString = "BC";
string[] newStrings = { "AA", "BB", "CC", "DD", "EE" };

// In input, replace oldString with tabs, so that we can do String.Split later
var inputTabbed = input.Replace(oldString, splitter.ToString());
// ABCDABCABCDABCABCDABCABCDABCD --> A\tDA\tA\tDA\tA\tDA\tA\tDA\tD

var inputs = inputTabbed.Split(splitter);
/* inputs (the template) now contains:
[0] "A" 
[1] "DA"
[2] "A" 
[3] "DA"
[4] "A" 
[5] "DA"
[6] "A" 
[7] "DA"
[8] "D" 
*/

// In parallel, build the output using the template (inputs)
// and the replacement strings (newStrings)
var outputs = new List<string>();
Parallel.ForEach(newStrings, iteration =>
    {
        var output = string.Join(iteration, inputs);
        // only lock the list operation
        lock (outputs) { outputs.Add(output); }
    });

foreach (var output in outputs)
    Console.WriteLine(output);

Output:

AAADAAAAAADAAAAAADAAAAAADAAAD
ABBDABBABBDABBABBDABBABBDABBD
ACCDACCACCDACCACCDACCACCDACCD
ADDDADDADDDADDADDDADDADDDADDD
AEEDAEEAEEDAEEAEEDAEEAEEDAEED

So you can do a comparison, here is a complete method which can be used in the test code by Erti-Chris Eelmaa:

private static void TemplatingImp(string input, string replaceWhat, IEnumerable<string> replaceIterations)
{
    const char splitter = '\t';   // use a char that will not appear in your string

    var inputTabbed = input.Replace(replaceWhat, splitter.ToString());
    var inputs = inputTabbed.Split(splitter);

    // In parallel, build the output using the split parts (inputs)
    // and the replacement strings (newStrings)
    //var outputs = new List<string>();
    Parallel.ForEach(replaceIterations, iteration =>
    {
        var output = string.Join(iteration, inputs);
    });
}


回答8:

I had a similar issue on a project and I've implemented a Regex solution to perform multiple and case insensitive replacements on a file.

For efficiency purposes, I set criteria to go through the original string only once.

I've published a simple console app to test some strategies on https://github.com/nmcc/Spikes/tree/master/StringMultipleReplacements

The code for the Regex solution is similar to this:

Dictionary<string, string> replacements = new Dictionary<string, string>(StringComparer.OrdinalIgnoreCase);
    // Fill the dictionary with the proper replacements:

        StringBuilder patternBuilder = new StringBuilder();
                patternBuilder.Append('(');

                bool firstReplacement = true;

                foreach (var replacement in replacements.Keys)
                {
                    if (!firstReplacement)
                        patternBuilder.Append('|');
                    else
                        firstReplacement = false;

                    patternBuilder.Append('(');
                    patternBuilder.Append(Regex.Escape(replacement));
                    patternBuilder.Append(')');
                }
                patternBuilder.Append(')');

                var regex = new Regex(patternBuilder.ToString(), RegexOptions.IgnoreCase);

                return regex.Replace(sourceContent, new MatchEvaluator(match => replacements[match.Groups[1].Value]));

EDIT: The execution times running the test application on my computer are:

  • Looping through the replacements calling string.Substring() (CASE SENSITIVE): 2ms
  • Single pass using Regex with multiple replacements at once (Case insensitive): 8ms
  • Looping through replacements using a ReplaceIgnoreCase Extension (Case insensitive): 55ms