The title is the question. Below is my attempt to answer it through research. But I don't trust my uninformed research so I still pose the question (What is the fastest way to iterate through individual characters in a string in C#?).
Occasionally I want to cycle through the characters of a string one-by-one, such as when parsing for nested tokens -- something which cannot be done with regular expressions. I am wondering what the fastest way is to iterate through the individual characters in a string, particularly very large strings.
I did a bunch of testing myself and my results are below. However there are many readers with much more in depth knowledge of the .NET CLR and C# compiler so I don't know if I'm missing something obvious, or if I made a mistake in my test code. So I solicit your collective response. If anyone has insight into how the string indexer actually works that would be very helpful. (Is it a C# language feature compiled into something else behind the scenes? Or something built in to the CLR?).
The first method using a stream was taken directly from the accepted answer from the thread: how to generate a stream from a string?
Tests
longString
is a 99.1 million character string consisting of 89 copies of the plain-text version of the C# language specification. Results shown are for 20 iterations. Where there is a 'startup' time (such as for the first iteration of the implicitly created array in method #3), I tested that separately, such as by breaking from the loop after the first iteration.
Results
From my tests, caching the string in a char array using the ToCharArray() method is the fastest for iterating over the entire string. The ToCharArray() method is an upfront expense, and subsequent access to individual characters is slightly faster than the built in index accessor.
milliseconds
---------------------------------
Method Startup Iteration Total StdDev
------------------------------ ------- --------- ----- ------
1 index accessor 0 602 602 3
2 explicit convert ToCharArray 165 410 582 3
3 foreach (c in string.ToCharArray)168 455 623 3
4 StringReader 0 1150 1150 25
5 StreamWriter => Stream 405 1940 2345 20
6 GetBytes() => StreamReader 385 2065 2450 35
7 GetBytes() => BinaryReader 385 5465 5850 80
8 foreach (c in string) 0 960 960 4
Update: Per @Eric's comment, here are results for 100 iterations over a more normal 1.1 M char string (one copy of the C# spec). Indexer and char arrays are still fastest, followed by foreach(char in string), followed by stream methods.
milliseconds
---------------------------------
Method Startup Iteration Total StdDev
------------------------------ ------- --------- ----- ------
1 index accessor 0 6.6 6.6 0.11
2 explicit convert ToCharArray 2.4 5.0 7.4 0.30
3 for(c in string.ToCharArray) 2.4 4.7 7.1 0.33
4 StringReader 0 14.0 14.0 1.21
5 StreamWriter => Stream 5.3 21.8 27.1 0.46
6 GetBytes() => StreamReader 4.4 23.6 28.0 0.65
7 GetBytes() => BinaryReader 5.0 61.8 66.8 0.79
8 foreach (c in string) 0 10.3 10.3 0.11
Code Used (tested separately; shown together for brevity)
//1 index accessor
int strLength = longString.Length;
for (int i = 0; i < strLength; i++) { c = longString[i]; }
//2 explicit convert ToCharArray
int strLength = longString.Length;
char[] charArray = longString.ToCharArray();
for (int i = 0; i < strLength; i++) { c = charArray[i]; }
//3 for(c in string.ToCharArray)
foreach (char c in longString.ToCharArray()) { }
//4 use StringReader
int strLength = longString.Length;
StringReader sr = new StringReader(longString);
for (int i = 0; i < strLength; i++) { c = Convert.ToChar(sr.Read()); }
//5 StreamWriter => StreamReader
int strLength = longString.Length;
MemoryStream stream = new MemoryStream();
StreamWriter writer = new StreamWriter(stream);
writer.Write(longString);
writer.Flush();
stream.Position = 0;
StreamReader str = new StreamReader(stream);
while (stream.Position < strLength) { c = Convert.ToChar(str.Read()); }
//6 GetBytes() => StreamReader
int strLength = longString.Length;
MemoryStream stream = new MemoryStream(Encoding.Unicode.GetBytes(longString));
StreamReader str = new StreamReader(stream);
while (stream.Position < strLength) { c = Convert.ToChar(str.Read()); }
//7 GetBytes() => BinaryReader
int strLength = longString.Length;
MemoryStream stream = new MemoryStream(Encoding.Unicode.GetBytes(longString));
BinaryReader br = new BinaryReader(stream, Encoding.Unicode);
while (stream.Position < strLength) { c = br.ReadChar(); }
//8 foreach (c in string)
foreach (char c in longString) { }
Accepted answer:
I interpreted @CodeInChaos and Ben's notes as follows:
fixed (char* pString = longString) {
char* pChar = pString;
for (int i = 0; i < strLength; i++) {
c = *pChar ;
pChar++;
}
}
Execution for 100 iterations over the short string was 4.4 ms, with < 0.1 ms st dev.
These kind of artificial tests are pretty dangerous. Notable is that your //2 and //3 versions of the code never actually indexes the string. The jitter optimizer just throws away the code since the c variable isn't used at all. You are just measuring how long the for() loop takes. You can't really see this unless you look at the generated machine code.
Change it to
c += longString[i];
to force the array indexer to be used.Which is nonsense of course. Profile only real code.
If micro optimization is very important for you, then try this. (I assumed input string's length to be multiple of 8 for simplicity)
Just kidding, never use this code :)
The fastest answer is to use C++/CLI: How to: Access Characters in a System::String
This approach iterates through the characters in-place in the string using pointer arithmetic. There are no copies, no implicit range checks, and no per-element function calls.
It's likely possible to get (nearly, C++/CLI doesn't require pinning) the same performance from C# by writing an unsafe C# version of
PtrToStringChars
.Something like:Do remember to call
GCHandle.Free
afterwards.CodeInChaos's comment points out that C# provides a syntactic sugar for this:
If speed really matters
for
is faster thanforeach
TL;DR: a simple
foreach
is the fastest way to iterate a string.For people coming back to this: times change!
With the latest .NET 64-bit JIT, the unsafe version actually is the slowest.
Below is a benchmark implementation for BenchmarkDotNet. From these, I got the following results:
The interesting ones are the one that do not work on an array copy. This shows that indexing and
foreach
are very similar in performance, with a 5% difference,foreach
being faster. Usingunsafe
is actually 28% slower than using aforeach
.In the past
unsafe
may have been the fastest option, but JIT's get faster and smarter all the time.As a reference, the benchmark code:
The code has a few minor changes from the offered code. The chars that are retrieved from the original string are
|
-ed with the variable being returned, and we return the value. The reason for this is that we actually need to do something with the result. Otherwise, if we'd just be iterating over the string like:the JIT is free to remove this because it could infer that you're not actually observing the results of the iteration. By
|
-ing the characters in the array and returning this, BenchmarkDotNet will make sure that the JIT can't perform this optimization.Any reason not to include
foreach
?Is this really going to be your performance bottleneck, by the way? What proportion of your total running time does the iteration itself take?