all.
I wrote a small case to test the multi-threaded producer/consumer model. My testbed is a low performance PC(8G RAM, J1900 CPU with 4 cores). I isolated the core 0 for Linux kernel, core 1-3 are unused. The producer thread runs on core 1, allocates 5000000 small objects, put them to a global queue. The consumer thread runs on core 2 and deallocate the objects from the queue. But I found if I do not set their CPU affinity(that is, they run on same core 0), the time performance get better than setting CPU affinity(8.76s VS 14.66s). The test results keep similar. Could someone explain the reason for me? IF my premise is incorrect("Setting CPU affinity can improve multi-threaded process' performance"), it should not get worse at least. My code slice listed below:
void producer() {
Timestamp begin;
for ( int i = 0; i<data_nb; ++i ) {
Test* test = new Test(i, i+1);
queue.enqueue(test);
}
Timestamp end;
TimeDuration td = end-begin;
printf("producer: %ldms(%.6fs)\n", td.asMicroSecond(), td.asSecond());
}
void consumer() {
Timestamp begin;
do {
Test* test = queue.dequeue();
if ( test ) {
nb.add(1); // nb is an atomic counter
delete test;
test = nullptr;
}
} while ( nb.get() < data_nb );
Timestamp end;
TimeDuration td = end-begin;
//printf("%d data consumed\n", nb.get());
printf("consumer: %ldms(%.6fs)\n", td.asMicroSecond(), td.asSecond());
}
Gaining performance from CPU affinity is not so simple as pushing thread 1 to core 1 and thread 2 to core 2. This is a complex and heavily researched topic, I'll touch on the highlights.
First off we need to define 'performance'. Typically, we are interested in throughput, latency and/or scalability. Combining all three is a tricky architecture question that has received intense scrutiny in the Telecom, Finance and other industries.
Your case appears to be driven by the throughput metric. We want the sum of wall clock time across threads to be minimal. Another way to state your question might be, "What are the factors that affect throughput in a multithreaded process?"
Here are some of the many factors:
Applying these ideas to your example, I see some issues. I assume you have two separate threads running hot and in parallel. One thread producing (new) and the other consuming (delete).
The heap is serial. Calling new and delete in different threads is a known performance bottleneck. There are several small block parallel allocators available including Hoard.
The queue is likely serial. The implementation is not shown but the enqueue/dequeue is likely a point of serialization between the two threads. There are many examples of lock free ring buffers that can be used between multiple threads.
Thread starvation. In the example, if the producer is slower than the consumer then consumer will just be idling for much of the time. This is part of the queue theory that must be considered when crafting a high performance algorithm.
With all that background, we can now conclude that thread affinity isn't likely to matter until the serialization and starvation problems are solved. The two threads may in fact run slower as they contend against one another for the shared resource or just waste cpu time idling. As a result, the overall throughput goes down thus the wall clock time goes up.
There is a huge demand in industry for engineers who understand these kinds of algorithms. Educating yourself is likely to be a profitable venture.