Optimizing C# large dataset iterations - External

The current task, iterating over massive dictionaries, is giving me a headache. I cannot pinpoint the exact source of high CPU usage here so I hope some of the C# gurus here can give me some hints and tips.

The setup is 10 preallocated Guid-byte[] dictionaries, each holding one million entries. The process is iterating over all of them, each dictionary has it's own thread. Simply iterating over all of them and passing byte[] reference to iteration delegate, yielding random result takes under 2ms, but actually accessing any byte in the containing entries causes this number to rise to 300+ms.

Note: The iteration delegate is constructed before any iterations and then I'm only passing reference.

If i'm not doing anything with the received byte reference, it's all incredibly fast:

            var iterationDelegate = new Action<byte[]>((bytes) =>
            {
                var x = 5 + 10;
            });

But once I attempt to access the very first byte (that actually contains a pointer to the row's metadata somewhere else)

            var iterationDelegate = new Action<byte[]>((bytes) =>
            {
                var b = (int)bytes[0];
            });

The total time shoots up and what's even weirder, the first set of iterations takes 30ms, the second 40+, the third 100+ and the fourth can take 500ms+... then I stop testing the performance, Sleep the calling thread for a few seconds and once I start iterating again, it starts casually at 30ms and then rises same as before until I give it "time to breathe" again.

When I watch it in the VS CPU call tree, 93% of the CPU is consumed by [External Code] that I cannot view or at least see what it is.

Is there anything I can do to help this? Is it the GC having a rough time?

Edit 1: The actual code I want to run is:

            var iterationDelegate = new Action<byte[]>((data) =>
            {
                //compare two bytes, ensure the row belongs to desired table
                if (data[0] != table.TableIndex)
                    return;

                //get header length
                var headerLength = (int)data[1];

                //process the header info and retrieve the desired column data position:

                var columnInfoPos = (key * 6) + 2;

                var pointers = new int[3] {
                    //data position
                BitConverter.ToInt32(new byte[4] {
                    data[columnInfoPos],
                    data[columnInfoPos + 1],
                    data[columnInfoPos + 2],
                    data[columnInfoPos + 3] }),
                    //data length
                BitConverter.ToUInt16(new byte[2] {
                    data[columnInfoPos + 4],
                    data[columnInfoPos + 5] }),
                //column info position
                columnInfoPos };


            });

But this code is even slower, the iteration times are ~150, ~300, ~600, 700+

This is the worker class that's kept alive for each store in respective threads:

            class PartitionWorker
            {
                private ManualResetEvent waitHandle = new ManualResetEvent(true);
                private object key = new object();
                private bool stop = false;
                private List<Action> queue = new List<Action>();

                public void AddTask(Action task)
                {
                    lock (key)
                        queue.Add(task);
                    waitHandle.Set();
                }

                public void Run()
                {
                    while (!stop)
                    {
                        lock (key)
                            if (queue.Count > 0)
                            {
                                var task = queue[0];
                                task();
                                queue.Remove(task);
                                continue;
                            }
                        waitHandle.Reset();
                        waitHandle.WaitOne();
                    }
                }

                public void Stop()
                {
                    stop = true;
                }
            }

And lastly a code that launches the iterations, this code is run from a Task for each incoming TCP request.

            for (var memoryPartition = 0; memoryPartition < partitions; memoryPartition++)
            {
                var memIndex = memoryPartition;
                mem[memIndex].AddJob(() =>
                {
                    try
                    {
                        //... to keep it shor i have excluded readlock and try/finally
                        foreach (var obj in mem[memIndex].innerCache.Values)
                        {
                            iterationDelegate(obj.bytes);
                        }
                        //release readlock in finally..
                    }
                    catch
                    {

                    }
                    finally
                    {
                        latch.Signal();
                    }
                });
            }
            try
            {
                latch.Wait(50);
                sw.Stop();
                Console.WriteLine("Found " + result.Count + " in " + sw.Elapsed.TotalMilliseconds + "ms");
            }
            catch
            {
                Console.WriteLine(">50");
            }

Edit2: The dictionaries are preallocated using

private Dictionary<Guid, byte[]> innerCache = new Dictionary<Guid, byte[]>(part_max_entries);

and regarding the entries, they are 70 bytes on average. The process is taking around 2Gb of memory with 10 000 000 entries split among 10 dictionaries.

The structure of the entry is following:

T | HL | {POS | POS | POS | POS | LEN | LEN} | {data bytes}

where | indicates separate bytes

T is a byte pointer to table metadata dictionary
HL is a byte length of the header portion if the entry

POS and LEN repeat for each data value in the entry:

POSx4 = int indicating the position of this data in the entry
POSx2 = ushort length of this data in the entry

and then {data bytes} are the data payload

For those who might be wondering, the greatest performance gain was to actually use hot spinning instead of sleeping/delaying/WaitHandles. The CPU hit is negligible even with large number of parallel requests. For very intensive operations There is a fallback implemented, that if the spinning takes longer than 3ms, it falls back to Thread wait. The code is now running at quite constant 24ms / 10mil entries. Also removing any GC collections from the code and recycling as many variables as I can was beneficial.

Here's the spinner code I use:

    private static void spin(ref Stopwatch sw, double spinSeconds)
    {
        sw.Start();
        while (sw.ElapsedTicks < spinSeconds) { }
        sw.Stop();
    }

Note: This can only be used with code that is running in it's own thread! If you use it in single-threaded application, you will block all your code here.

Edit: Also it's worth noting, that for some reason rewriting the for loop in a way so it counts to 0 had a significant performance impact. I don't know the exact mechanics as to why, but I assume comparing to zero is simply faster.

I also modified the dictionary, it's now a Dictionary(Guid,Int). I added a byte[][] array and the dictionary int points to an index in this array. It is way faster iterating over this array than enumerating the dictionary elements and iterating over them. There are mechanics I needed to implement to ensure consistency though.