Threadpool program runs much slower on much faster

2020-02-26 01:59发布

问题:

upd I now think that root of my problem not "threading", because I observe slowdown at any point of my program. I think somehow when using 2 processors my program executes slower probably because two processors need to "communicate" between each other. I need to do some tests. I will try to disable one of the processors and see what happens.

====================================

I'm not sure if this is C# question, probably it more about hardware, but I think C# will be most suitable.

I was using cheap DL120 server and I decided to upgrade to much more expensive 2 processors DL360p server. Unexpectedly my C# program works about ~2 times slower on new server which supposed to be several times faster.

I processed FAST data for about 60 Instruments. I have created separate Task for each Instrument like that:

        BlockingCollection<OrderUpdate> updatesQuery;
        if (instrument2OrderUpdates.ContainsKey(instrument))
        {
            updatesQuery = instrument2OrderUpdates[instrument];
        } else
        {
            updatesQuery = new BlockingCollection<OrderUpdate>();
            instrument2OrderUpdates[instrument] = updatesQuery;
            ScheduleFastOrdersProcessing(updatesQuery);
        }
        orderUpdate.Checkpoint("updatesQuery.Add");
        updatesQuery.Add(orderUpdate);
    }

    private void ScheduleFastOrdersProcessing(BlockingCollection<OrderUpdate> updatesQuery)
    {
        Task.Factory.StartNew(() =>
        {
            Instrument instrument = null;
            OrderBook orderBook = null;
            int lastRptSeqNum = -1;
            while (!updatesQuery.IsCompleted)
            {
                OrderUpdate orderUpdate;
                try
                {
                    orderUpdate = updatesQuery.Take();
                } catch(InvalidOperationException e)
                {
                    Log.Push(LogItemType.Error, e.Message);
                    continue;
                }
                orderUpdate.Checkpoint("received from updatesQuery.Take()");
                ......................
                ...................... // long not interesting processing code
        }, TaskCreationOptions.LongRunning);

Because I have about 60 task which can be executed in parallel I expect that 2 * E5-2640 (24 virtual threads, 12 real threads) should perform much more faster than 1 * E3-1220 (4 real threads). It seems that using DL360p I found 95 threads in task manager. Using DL120 I have only 55 threads.

But execution time on DL120G7 is 2 (!!) times faster! E3-1220 has a little bit better clock rate than E5-2640 (3.1 GHz vs 2.5Ghz) however I still expect that my code should work faster on 2 * E5-2640 because it can be paralleled much better and I absolutely do not expect that it work 2 times slower!

HP DL120G7 E3-1220

~50 threads in Task Manager best = 24 average ~ 80 microseconds

 calling market.UpdateFastOrder = 23 updatesQuery.Add = 25 received from updatesQuery.Take() = 67 in orderbook = 80
 calling market.UpdateFastOrder = 30 updatesQuery.Add = 32 received from updatesQuery.Take() = 64 in orderbook = 73
 calling market.UpdateFastOrder = 31 updatesQuery.Add = 32 received from updatesQuery.Take() = 195 in orderbook = 204
 calling market.UpdateFastOrder = 31 updatesQuery.Add = 32 received from updatesQuery.Take() = 74 in orderbook = 86
 calling market.UpdateFastOrder = 18 updatesQuery.Add = 21 received from updatesQuery.Take() = 65 in orderbook = 78
 calling market.UpdateFastOrder = 29 updatesQuery.Add = 32 received from updatesQuery.Take() = 76 in orderbook = 88
 calling market.UpdateFastOrder = 30 updatesQuery.Add = 32 received from updatesQuery.Take() = 80 in orderbook = 92
 calling market.UpdateFastOrder = 20 updatesQuery.Add = 21 received from updatesQuery.Take() = 65 in orderbook = 78
 calling market.UpdateFastOrder = 21 updatesQuery.Add = 24 received from updatesQuery.Take() = 68 in orderbook = 81
 calling market.UpdateFastOrder = 12 updatesQuery.Add = 13 received from updatesQuery.Take() = 58 in orderbook = 72
 calling market.UpdateFastOrder = 22 updatesQuery.Add = 23 received from updatesQuery.Take() = 51 in orderbook = 59
 calling market.UpdateFastOrder = 16 updatesQuery.Add = 16 received from updatesQuery.Take() = 20 in orderbook = 24
 calling market.UpdateFastOrder = 28 updatesQuery.Add = 31 received from updatesQuery.Take() = 82 in orderbook = 94
 calling market.UpdateFastOrder = 18 updatesQuery.Add = 21 received from updatesQuery.Take() = 65 in orderbook = 77
 calling market.UpdateFastOrder = 29 updatesQuery.Add = 29 received from updatesQuery.Take() = 259 in orderbook = 264
 calling market.UpdateFastOrder = 49 updatesQuery.Add = 52 received from updatesQuery.Take() = 99 in orderbook = 113
 calling market.UpdateFastOrder = 22 updatesQuery.Add = 23 received from updatesQuery.Take() = 50 in orderbook = 60
 calling market.UpdateFastOrder = 29 updatesQuery.Add = 32 received from updatesQuery.Take() = 76 in orderbook = 88
 calling market.UpdateFastOrder = 16 updatesQuery.Add = 19 received from updatesQuery.Take() = 63 in orderbook = 75
 calling market.UpdateFastOrder = 27 updatesQuery.Add = 27 received from updatesQuery.Take() = 226 in orderbook = 231
 calling market.UpdateFastOrder = 15 updatesQuery.Add = 16 received from updatesQuery.Take() = 35 in orderbook = 42
 calling market.UpdateFastOrder = 18 updatesQuery.Add = 21 received from updatesQuery.Take() = 66 in orderbook = 78

HP DL360p G8 2 * E5-2640

~95 threads in Task Manager; best = 40 average ~ 150 microseconds

 calling market.UpdateFastOrder = 62 updatesQuery.Add = 64 received from updatesQuery.Take() = 144 in orderbook = 205
 calling market.UpdateFastOrder = 27 updatesQuery.Add = 32 received from updatesQuery.Take() = 101 in orderbook = 154
 calling market.UpdateFastOrder = 45 updatesQuery.Add = 50 received from updatesQuery.Take() = 124 in orderbook = 187
 calling market.UpdateFastOrder = 46 updatesQuery.Add = 51 received from updatesQuery.Take() = 127 in orderbook = 162
 calling market.UpdateFastOrder = 63 updatesQuery.Add = 68 received from updatesQuery.Take() = 137 in orderbook = 174
 calling market.UpdateFastOrder = 53 updatesQuery.Add = 55 received from updatesQuery.Take() = 133 in orderbook = 171
 calling market.UpdateFastOrder = 44 updatesQuery.Add = 46 received from updatesQuery.Take() = 131 in orderbook = 158
 calling market.UpdateFastOrder = 37 updatesQuery.Add = 39 received from updatesQuery.Take() = 102 in orderbook = 140
 calling market.UpdateFastOrder = 45 updatesQuery.Add = 50 received from updatesQuery.Take() = 115 in orderbook = 154
 calling market.UpdateFastOrder = 50 updatesQuery.Add = 55 received from updatesQuery.Take() = 133 in orderbook = 160
 calling market.UpdateFastOrder = 26 updatesQuery.Add = 50 received from updatesQuery.Take() = 99 in orderbook = 111
 calling market.UpdateFastOrder = 14 updatesQuery.Add = 30 received from updatesQuery.Take() = 36 in orderbook = 40   <-- best one I can find among thousands

Are you able to see why my program runs 2 times slower on several times faster server? Probably I should not create ~60 Task? Probably I should tell .NET not to use 95 threads but to limit it to 50 or even 24? Probably this is 2 processors vs 1 processor configuration issue? Probably just disabling one of the processors on my DL360P Gen8 will speed-up program significantly?

Added

  • calling market.UpdateFastOrder - orderUpdate object is created
  • updatesQuery.Add - orderUpdate is put into BlockingCollection
  • received from updatesQuery.Take() - orderUpdate ejected from BlockingCollection
  • in orderbook - orderUpdated is parsed and applied to orderBook

回答1:

just because you have a System which can handle much more threads, this does not mean that all of them can be fully processed parallel.

When I upgrade from a Quadcore CPU to a i7(virtual 8 cores), I noticed that a setup using more threads than cores resulted in the threads blocking each other for some time, which lead to an overall slowdown of the System.

The problem was just that my algorythims already were capable of using the full processing time of the core their thread was running on while waiting threads only worked on about 5 to 10%, which lead to the main threads to finish but some singe threads still having to do all their work(taking the same amout of time again).

The threadpool will only continue if all workers have finished, so the total amount of time until finishing will be unuset processor time for the other threads.

maybe you just need to find an optimal number of threads.