I am developing a real-time application on a server with two NUMA nodes. Below is a simplified version of the system diagram (the OS is Ubuntu14.04):
.-------------. .-------------.
| Device 0 | | Device 1 |
.-------------. .-------------.
|| ||
|| ||
.-------------. .-------------.
| PCIE slot 0 | | PCIE slot 1 |
.-------------. .-------------.
|| ||
|| ||
.-------------. QPI link .-------------.
| CPU 0 |<-------->| CPU 1 |
.-------------. .-------------.
|| ||
|| ||
.-------------. .-------------.
| Mem node 0 | | Mem node 1 |
.-------------. .-------------.
The two devices do exactly the same thing, but each of them consumes all the power of one CPU so I have to create two threads to control them separately. Also I have to use a third party library to control the two devices.
Although I can use numa_alloc
to allocate memory on a local node in my own thread that doesn't change the way how the third party library allocates its own memory.
First: If I run the program with only one device with numactl --membind
I get very good performance.
Second: If I run the program with two devices without numactl --membind
I get relatively good performance for both of the devices but if I run it for a long time the performance is not so stable.
Thrid: If I run the program with two devices together withnumactl --membind
, say numactl --membind=0
, then device 0 performs quite well but not device 1, and vice verse.
Base on the above observations, I suspect that memory locality is the performance bottleneck here.
My question: Can I put some sort of constraints in my thread so that everything inside that thread is allocated on a specific NUMA node, including those of a third party library?