String sorting performance degradation in VS2010 v

2019-01-18 10:22发布

问题:

The following C# code seems to run slower when built with VS2010 than with VS2008: on a Core i5 Win7 x64 8 GB RAM PC, the VS2008 built version sorts strings in about 7.5 seconds, instead the VS2010 built version requires about 9 seconds. Why is that?

Is there anything wrong with my code?

Did the sorting algorithm change in VS2010?

Is there anything different in the underlying CLR that makes the performance worse?

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Globalization;
using System.Linq;

namespace StringSortCSharp
{
    /// <summary>
    /// Console app to test string sorting performance in C#.
    /// </summary>
    class Program
    {
        /// <summary>
        /// Displays the first lines from a vector of strings.
        /// </summary>
        /// <param name="wishedN">Number of lines to display.</param>
        /// <param name="lines">Source lines to display.</param>
        private static void DisplayFirst(int wishedN, List<string> lines)
        {
            int n = Math.Min(wishedN, lines.Count);
            for (int i = 0; i < n; i++)
            {
                Console.WriteLine("  " + lines[i]);
            }
            Console.WriteLine();
        }

        /// <summary>
        /// Used for random permutation.
        /// </summary>
        private static Random random = new Random();

        /// <summary>
        /// Computes a random permutation of the input sequence.
        /// 
        /// From:
        ///     http://stackoverflow.com/questions/375351/most-efficient-way-to-randomly-sort-shuffle-a-list-of-integers-in-c-sharp
        /// 
        /// </summary>
        /// <typeparam name="T">Type stored in the sequences.</typeparam>
        /// <param name="sequence">Input sequence.</param>
        /// <returns>Random permutation of the input sequence.</returns>
        private static IEnumerable<T> RandomPermutation<T>(IEnumerable<T> sequence)
        {
            T[] retArray = sequence.ToArray();


            for (int i = 0; i < retArray.Length - 1; i += 1)
            {
                int swapIndex = random.Next(i + 1, retArray.Length);
                T temp = retArray[i];
                retArray[i] = retArray[swapIndex];
                retArray[swapIndex] = temp;
            }
            return retArray;
        }


        /// <summary>
        /// Builds a list of strings used in the performance benchmark.
        /// </summary>
        /// <returns>Test list of strings.</returns>
        private static List<string> BuildTestLines()
        {
            // Start with "Lorem ipsum", and repeat it several times, adding some suffix strings.

            var lorem = new string[]
             {
                 "Lorem ipsum dolor sit amet, consectetuer adipiscing elit.",
                 "Maecenas porttitor congue massa. Fusce posuere, magna sed",
                 "pulvinar ultricies, purus lectus malesuada libero,",
                 "sit amet commodo magna eros quis urna.",
                 "Nunc viverra imperdiet enim. Fusce est. Vivamus a tellus.",
                 "Pellentesque habitant morbi tristique senectus et netus et",
                 "malesuada fames ac turpis egestas. Proin pharetra nonummy pede.",
                 "Mauris et orci."
             };

            int repeatCount = 200 * 1000;

            Console.Write("Building test strings");
            var testLines = new List<string>();

            Console.Write(" (total string count = {0})", repeatCount * lorem.Length);
            Console.Write("...");
            for (int i = 0; i < repeatCount; i++)
            {
                for (int j = 0; j < lorem.Length; j++)
                {
                    // Add more stuff to Lorem strings 
                    testLines.Add(lorem[j] + " (#" + i + ")");
                }
            }
            Console.WriteLine("done.");

            DisplayFirst(5, testLines);
            Console.WriteLine();

            // Shuffle the previously built strings.

            Console.Write("Shuffling strings...");
            var randomLines = new List<string>(RandomPermutation(testLines));
            Console.WriteLine("done.");
            DisplayFirst(5, randomLines);
            Console.WriteLine();

            return randomLines;
        }


        /// <summary>
        /// Sort the input lines.
        /// </summary>
        /// <param name="lines">Input lines to sort.</param>
        private static void Test(List<string> lines)
        {
            // Stopwatch to measure time performance
            var timer = new Stopwatch();

            Console.Write("Sorting " + lines.Count + " lines...");

            // Sort benchmark

            timer.Start();
            lines.Sort();
            timer.Stop();
            Console.WriteLine("done.");

            // Display results

            DisplayFirst(5, lines);

            Console.WriteLine();
            Console.WriteLine((timer.ElapsedMilliseconds / 1000.0).ToString(CultureInfo.InvariantCulture) + " seconds elapsed.");
        }

        static void Main(string[] args)
        {
            Console.WriteLine("*** Testing String Sorting in C# ***");
            Console.WriteLine();

            // Build test lines used for the sort benchmark
            List<string> testLines = BuildTestLines();

            // Run the sort test
            Test(testLines);
        }
    }
}

回答1:

Here is a brief outline of sorting algorithms used in .NET versions. It's helpful to remember that List<T>.Sort() internally uses Array<T>.Sort()

  • In .NET 2.0-4.0, a quick sort algorithm is used to sort an Array. There have been minor changes to the code, but for the most part, the code remains the same.
  • In .NET 4.5, the array sorting algorithm changed from quick sort to an introspective sort. This is a larger change than from before, one that, at least in my tests, shows considerable performance improvements.

Did the sorting algorithm change in VS2010?

Yes, but the changes were minor, and doesn't affect performance. Consider a sort against 20 million shuffled integers1:

List<int>.Sort() (20 million)

.NET 3.5       .NET 4.0       .NET 4.5
---------      ---------      ---------
 2.564s         2.565s         2.337s

There's no change between v3.5 and v4.0 in terms of performance. There is a noticeable increase in speed for v4.5. It's clear that it's not the actual sorting algorithm that is making the difference.

Before we jump into your next question, let me share my results of running your actual code on my machine:

List<string>.Sort() (1.6 million)

.NET 3.5       .NET 4.0       .NET 4.5
---------      ---------      ---------
 7.953s         11.267s        10.092s

I get similar results, as you do. These results are a good lead-in to your next question:

Is there anything different in the underlying CLR that makes the performance worse?

Without a doubt. So, what is the difference? The difference is in string comparison implementation. In each step of the sorting algorithm it needs to compare the two strings, and it happens to do it differently between the v2.0 and v4.0 runtime. (See extra notes below)

The easiest way to prove this is to force sorting by ordinal position, instead of culture dependence. Replace lines.Sort(); with lines.Sort(StringComparer.Ordinal);. Here is what I measured:

List<string>.Sort(StringComparer.Ordinal) (1.6 million)

.NET 3.5       .NET 4.0       .NET 4.5
---------      ---------      ---------
 4.088s         3.76s          3.454s

Now, that looks better! It's more or less what I expected; a steady increase in speed for each version of the framework released. MSDN Suggests that if you're ever doing a non-linguistic comparison on a string, you should use an ordinal comparison.

However, that only solves the problem if your comparison or sorting isn't culture-sensitive. If you need culture-sensitive sorting, it seems you won't be able to get rid of the slower execution time unless you want to revert to the .NET 3.5 framework.


Extra notes

When you don't pass a comparer to List<T>.Sort() or Array.Sort, it will use the default comparer. Default comparers for .NET strings uses the comparer from the Thread's current culture. From there, it calls some internal functions in the .NET runtime native libraries.

In v2.0-3.5, it calls COMNlsInfo::Compare and COMNlsInfo::CompareFast. Here's what the call stack (kinda) looks like:

String.CompareTo(string)
+--System.Globalization.CompareInfo.Compare(string,string,CompareOptions)
   +--mscorwks.dll!COMNlsInfo::Compare
      +--mscorwks.dll!COMNlsInfo::CompareFast

Similar source for these functions is visible in the shared source implementation of the CLI (SSCLI). It's located in sscli\clr\src\classlibnative\nls\comnlsinfo.cpp on lines 1034 and 893, respectively.

In v4.0, however, that call tree changed fairly significantly:

String.CompareTo(string)
+--System.Globalization.CompareInfo.Compare(string,string,CompareOptions)
   +--clr.dll!COMNlsInfo::InternalCompareString
      +--clr.dll!SortVersioning::SortDllCompareString
         +--nlssorting.dll!_SortCompareString
            +--nlssorting.dll!_AsciiCompareString

I wish I could tell you why one is slower than the other, but I have no clue at all and there is no SSCLI for .NET 4.0 to compare against. The major changes to string handling in .NET 4.0 weren't without problems. There have been performance issues related to strings in .NET 4.0, however they don't really apply here.


1All tests were run in a virtual machine. Win 2008R2 x64 w/ 4GB RAM and a virtual quad-core processor. Host machine is Win7 x64 w/ 24GB RAM and a Xeon W3540 (2.93ghz) quad-core (8 Logical processors). Results are an average of 5 runs with the best and worst times removed.