all.
I wrote a small case to test the multi-threaded producer/consumer model. My testbed is a low performance PC(8G RAM, J1900 CPU with 4 cores). I isolated the core 0 for Linux kernel, core 1-3 are unused. The producer thread runs on core 1, allocates 5000000 small objects, put them to a global queue. The consumer thread runs on core 2 and deallocate the objects from the queue. But I found if I do not set their CPU affinity(that is, they run on same core 0), the time performance get better than setting CPU affinity(8.76s VS 14.66s). The test results keep similar. Could someone explain the reason for me? IF my premise is incorrect("Setting CPU affinity can improve multi-threaded process' performance"), it should not get worse at least. My code slice listed below:
void producer() {
Timestamp begin;
for ( int i = 0; i<data_nb; ++i ) {
Test* test = new Test(i, i+1);
queue.enqueue(test);
}
Timestamp end;
TimeDuration td = end-begin;
printf("producer: %ldms(%.6fs)\n", td.asMicroSecond(), td.asSecond());
}
void consumer() {
Timestamp begin;
do {
Test* test = queue.dequeue();
if ( test ) {
nb.add(1); // nb is an atomic counter
delete test;
test = nullptr;
}
} while ( nb.get() < data_nb );
Timestamp end;
TimeDuration td = end-begin;
//printf("%d data consumed\n", nb.get());
printf("consumer: %ldms(%.6fs)\n", td.asMicroSecond(), td.asSecond());
}
Gaining performance from CPU affinity is not so simple as pushing thread 1 to core 1 and thread 2 to core 2. This is a complex and heavily researched topic, I'll touch on the highlights.
First off we need to define 'performance'. Typically, we are interested in throughput, latency and/or scalability. Combining all three is a tricky architecture question that has received intense scrutiny in the Telecom, Finance and other industries.
Your case appears to be driven by the throughput metric. We want the sum of wall clock time across threads to be minimal. Another way to state your question might be, "What are the factors that affect throughput in a multithreaded process?"
Here are some of the many factors:
- Algorithm complexity has the greatest influence. The Big O, theta, little O are all extremely useful in complex cases. The example is trivial but this still matters. On the surface, the problem is O(n). The time will be linear based on the number of elements to allocate/deallocate. Your question strikes to the heart of this matter because it shows in that physical computers don't perfectly model ideal computers.
- CPU resource. Having CPU to throw at the problem can help if the problem can be parallelized. Your problem has the underlying assumption that two threads will be better than one. If so, perhaps four will be between than two. Again, your actual results contradict the theoretical model.
- Queuing model. Understanding the Queuing model is vitally important if performance gains are to be achieved. The example problem appears to be the classic single producer/single consumer model.
- Other Resources. Depending on the problem a variety of other resources can constrain performance. Some factors include disk space, disk throughput, disk latency, network capacity, socket availability. The example doesn't appear to suffer here.
- Kernel dependencies. Moving to a lower level, performance can be dramatically impacted by the amount of kernel interaction. Generally, kernel calls require a context switch which can be expensive if done constantly. Your example likely suffers this issue via the calls to new/delete.
- Serial access. If a resource requires serial access then it will bottleneck in parallel algorithm. Your example appears to have two such problems, the new/delete and the enqueue/dequeue.
- CPU cache. The comments mentioned CPU caching as a possibility. The L2/L3 caches can be a source of cache misses as well as false sharing. I doubt that this is the main problem in your example but it could be a factor.
Applying these ideas to your example, I see some issues. I assume you have two separate threads running hot and in parallel. One thread producing (new) and the other consuming (delete).
The heap is serial. Calling new and delete in different threads is a known performance bottleneck. There are several small block parallel allocators available including Hoard.
The queue is likely serial. The implementation is not shown but the enqueue/dequeue is likely a point of serialization between the two threads. There are many examples of lock free ring buffers that can be used between multiple threads.
Thread starvation. In the example, if the producer is slower than the consumer then consumer will just be idling for much of the time. This is part of the queue theory that must be considered when crafting a high performance algorithm.
With all that background, we can now conclude that thread affinity isn't likely to matter until the serialization and starvation problems are solved. The two threads may in fact run slower as they contend against one another for the shared resource or just waste cpu time idling. As a result, the overall throughput goes down thus the wall clock time goes up.
There is a huge demand in industry for engineers who understand these kinds of algorithms. Educating yourself is likely to be a profitable venture.