How does NUMA architecture affect the performance

2019-03-13 04:17发布

问题:

We are migrating an ActivePivot application to a new server (4 sockets Intel Xeon, 512GB of memory). After deploying we launched our application benchmark (that's a mix of large OLAP queries concurrent to real-time transactions). The measured performance is almost twice slower than on our previous server, that has similar processors but twice less cores and twice less memory.

We have investigated the differences between the two servers, and it appears the big one has a NUMA architecture (non uniform memory acccess). Each CPU socket is physically close to 1/4 of the memory, but further away from the rest of it... The JVM that runs our application allocates a large global heap, there is a random fraction of that heap on each NUMA node. Our analysis is that the memory access pattern is pretty random and CPU cores frequently waste time accessing remote memory.

We are looking after more feedback about leveraging ActivePivot on NUMA severs. Can we configure ActivePivot cubes, or thread pools, change our queries, configure the operating system?

回答1:

Peter described the general JVM options available today to reduce the performance impact of NUMA architectures. To keep it short a NUMA aware JVM will partition the heap with respect to the NUMA nodes, and when a thread creates a new object, the object is allocated in the NUMA node of the core that runs that thread (if the same thread later uses it, the object will be in the local memory). Also when compacting the heap the NUMA aware JVM avoids moving large data chunks between nodes (and reduces the length of stop-the-world events).

So on any NUMA hardware and for any Java application the -XX:+UseNUMA option should probably be enabled.

But for ActivePivot that does not help much: ActivePivot is an in-memory database. There are real-time updates but the bulk of the data resides in the main memory for the life of the application. Whatever the JVM options, the data will be split among NUMA nodes, and the threads that execute queries will access memory randomly. Knowing that most sections of the ActivePivot query engine run as fast as memory can be fetched, the NUMA impact is particularly visible.

So how can you get the most from your ActivePivot solution on a NUMA hardware?

There is an easy solution when the ActivePivot application only uses a fraction of the resources (we find that it is often the case when several ActivePivot solutions run on the same server). For instance an ActivePivot solution that only uses 16 cores out of 64, and 256GB out of a TeraByte. In that case you can restrict the JVM process itself to a NUMA node.

On Linux you prefix the JVM launch with the following option ( http://linux.die.net/man/8/numactl ):

numactl --cpunodebind=xxx

If the entire server is dedicated to one ActivePivot solution, you can leverage the ActivePivot Distributed Architecture to partition the data. If there are 4 NUMA nodes, you start 4 JVMs hosting 4 ActivePivot nodes, each one bound to its NUMA node. With this deployment queries are distributed among the nodes, and each node will perform its share of the work at max performance, within the right NUMA node.



回答2:

You can try using -XX:+UseNUMA

http://docs.oracle.com/javase/7/docs/technotes/guides/vm/performance-enhancements-7.html

If this doesn't yield the result you expect you might have to use taskset to lock a JVM to a specific socket and effectively break the server into four machines with one JVM each.

I have observed that machine with more sockets have slower access to their memory (even their local memory) and how always give you the performance gains you want as a result.