C# & .NET: stackalloc

2019-03-09 00:57发布

问题:

I have a few questions about the functionality of the stackalloc operator.

  1. How does it actually allocate? I thought it does something like:

    void* stackalloc(int sizeInBytes)
    {
        void* p = StackPointer (esp);
        StackPointer += sizeInBytes;
        if(StackPointer exceeds stack size)
            throw new StackOverflowException(...);
        return p;
    }
    

    But I have done a few tests, and I'm not sure that's how it work. We can't know exactly what it does and how it does it, but I want to know the basics.

  2. I thought that stack allocation (Well, I am actually sure about it) is faster than heap allocation. So why does this example:

     class Program
     {
         static void Main(string[] args)
         {
             Stopwatch sw1 = new Stopwatch();
             sw1.Start();
             StackAllocation();
             Console.WriteLine(sw1.ElapsedTicks);
    
             Stopwatch sw2 = new Stopwatch();
             sw2.Start();
             HeapAllocation();
             Console.WriteLine(sw2.ElapsedTicks);
         }
         static unsafe void StackAllocation()
         {
             for (int i = 0; i < 100; i++)
             {
                 int* p = stackalloc int[100];
             }
         }
         static void HeapAllocation()
         {
             for (int i = 0; i < 100; i++)
             {
                 int[] a = new int[100];
             }
         }
     }
    

gives the average results of 280~ ticks for stack allocation, and usually 1-0 ticks for heap allocation? (On my personal computer, Intel Core i7).

On the computer I am using now (Intel Core 2 Duo), the results make more sense that the previous ones (Probably because Optimize code was not checked in VS): 460~ ticks for stack allocation, and about 380 ticks for heap allocation.

But this still doesn't make sense. Why is it so? I guess that the CLR notices that we don't use the array, so maybe it doesn't even allocate it?

回答1:

A case where stackalloc is faster:

 private static volatile int _dummy; // just to avoid any optimisations
                                         // that have us measuring the wrong
                                         // thing. Especially since the difference
                                         // is more noticable in a release build
                                         // (also more noticable on a multi-core
                                         // machine than single- or dual-core).
 static void Main(string[] args)
 {
     System.Diagnostics.Stopwatch sw1 = new System.Diagnostics.Stopwatch();
     Thread[] threads = new Thread[20];
     sw1.Start();
     for(int t = 0; t != 20; ++t)
     {
        threads[t] = new Thread(DoSA);
        threads[t].Start();
     }
     for(int t = 0; t != 20; ++t)
        threads[t].Join();
     Console.WriteLine(sw1.ElapsedTicks);

     System.Diagnostics.Stopwatch sw2 = new System.Diagnostics.Stopwatch();
     threads = new Thread[20];
     sw2.Start();
     for(int t = 0; t != 20; ++t)
     {
        threads[t] = new Thread(DoHA);
        threads[t].Start();
     }
     for(int t = 0; t != 20; ++t)
        threads[t].Join();
     Console.WriteLine(sw2.ElapsedTicks);
     Console.Read();
 }
 private static void DoSA()
 {
    Random rnd = new Random(1);
    for(int i = 0; i != 100000; ++i)
        StackAllocation(rnd);
 }
 static unsafe void StackAllocation(Random rnd)
 {
    int size = rnd.Next(1024, 131072);
    int* p = stackalloc int[size];
    _dummy = *(p + rnd.Next(0, size));
 }
 private static void DoHA()
 {
    Random rnd = new Random(1);
    for(int i = 0; i != 100000; ++i)
        HeapAllocation(rnd);
 }
 static void HeapAllocation(Random rnd)
 {
    int size = rnd.Next(1024, 131072);
    int[] a = new int[size];
    _dummy = a[rnd.Next(0, size)];
 }

Important differences between this code and that in the question:

  1. We have several threads running. With stack allocation, they are allocating in their own stack. With heap allocation, they are allocating from a heap shared with other threads.

  2. Larger sizes allocated.

  3. Different sizes allocated each time (though I seeded the random generator to make the tests more deterministic). This makes heap fragmentation more likely to happen, making heap allocation less efficient than with identical allocations each time.

As well as this, it's also worth noting that stackalloc would often be used as an alternative to using fixed to pin an array on the heap. Pinning arrays is bad for heap performance (not just for that code, but also for other threads using the same heap), so the performance impact would be even greater then, if the claimed memory would be in use for any reasonable length of time.

While my code demonstrates a case where stackalloc gives a performance benefit, that in the question is probably closer to most cases where someone might eagerly "optimise" by using it. Hopefully the two pieces of code together show that whole stackalloc can give a boost, it can also hurt performance a lot too.

Generally, you shouldn't even consider stackalloc unless you are going to need to use pinned memory for interacting with unmanaged code anyway, and it should be considered an alternative to fixed rather than an alternative to general heap allocation. Use in this case still requires caution, forethought before you start, and profiling after you finish.

Use in other cases could give a benefit, but it should be far down the list of performance improvements you would try.

Edit:

To answer part 1 of the question. Stackalloc is conceptually much as you describe. It obtains a chunk of the stack memory, and then returns a pointer to that chunk. It doesn't check the memory will fit as such, but rather if it attempts to obtain memory into the end of the stack - which is protected by .NET on thread creation - then this will cause the OS to return an exceptioin to the runtime, which it then turns into a .NET managed exception. Much the same happens if you just allocate a single byte in a method with infinite recursion - unless the call got optimised to avoid that stack allocation (sometimes possible), then a single byte will eventually add up to enough to trigger the stack overflow exception.



回答2:

  1. I can't give an exact answer but stackalloc is implemented using the IL opcode localloc. I looked at the machine code generated by a release build for stackallocand it was more convoluted than I expected. I don't know if localloc will check the stack size as you indicate by your if or if the stack overflow is detected by the CPU when the hardware stack actually overflows.

    The comments to this answer indicate that the link provided to localloc allocates space from "the local heap". The problem is that there is no good online reference for MSIL except the actual standard available in PDF format. The link above is from the System.Reflection.Emit.OpCodes class which isn't about MSIL but rather a library for generating MSIL.

    However, in the standards document ECMA 335 - Common Language Infrastructure there is a more precise description:

    Part of each method state is a local memory pool. Memory can be explicitly allocated from the local memory pool using the localloc instruction. All memory in the local memory pool is reclaimed on method exit, and that is the only way local memory pool memory is reclaimed (there is no instruction provided to free local memory that was allocated during this method invocation). The local memory pool is used to allocate objects whose type or size is not known at compile time and which the programmer does not wish to allocate in the managed heap.

    So basically the "local memory pool" is what is otherwise known as "the stack" and the C# language uses the stackalloc operator to allocate from this pool.

  2. In a release build the optimizer is smart enough to completely remove the call to HeapAllocation resulting in much lower execution time. It seems that it isn't smart enough to perform the same optimization when using stackalloc. If you either turn off optimization or in some way uses the allocated buffer you will see that stackalloc is slightly faster.