Memory usage of ArangoDB

2020-02-17 05:57发布

问题:

I am trying to understand what the limits of Arangodb are and what the ideal setup is. From what I have understood arango stores all the collection data in the virtual memory and ideally you want this to fit in the RAM. If the collection grows and cannot fit in the RAM it will be swapped to disk.

So my first question. If my db grows will I need to adjust the swap partition/file to accommodate the db?

Since arango also syncs the data to disk does this mean that the data will always be located in the RAM and disk? So if I have a db that's 1.5GB and my RAM is 1GB I will need to at least have 0.5GB of swap disk and 1.5GB of regular disk space?

I am a bit confused how arango uses the virtual memory. Right now I have 7 collections that are practically empty. I have 1GB of RAM and 1GB of swap disk. The admin reports that arango is using 4.5GB of virtual memory. How is this possible if the swap disk is 1GB? It's currently using 80MB of RAM. Shouldn't this be 224MB if the journal size is 32MB for each collection?

What is the recommendation on the journal size vs collection size? Can this be dynamically adjusted as the collection grows?

What kind of performance is expected if the swap disk is used a lot when the disk is an SSD? If the swap disk is used a lot would the performance be similar to using a more traditional db such as mysql?

回答1:

ArangoDB stores all data in memory-mapped files. Each collection can have 0 to n datafiles, with a default filesize of 32 MB each (note that this filesize can be adjusted globally or on a per-collection level). An empty collection (that never had any data) will not have a datafile. The first write to a collection will create the datafile, and whenever a datafile is full, a new one will be created automatically.

Collections allocate datafiles in chunks of 32 MB by default. If you do have many but small collections this might waste some memory. If you many few but big collections, the potential waste (free space at the end of a datafile) probably doesn't matter too much.

Whenever any ArangoDB operation reads data from or writes data to a memory-mapped datafile, the operating system will first translate the offset into the file into a page number. This is because each datafile is implicitly split into pages of a specific size. How big a page is is platform-dependent, but let's assume pages are 4 KB in size. So a datafile with a default filesize will have 8192 pages.

After the OS has translated the offset into the file into a page number, it will make sure the data of requested page are present in physical RAM. If the page is not yet in physical RAM, the operating system will issue a page fault to trigger loading of the requested page from disk or swap into physical RAM. This will eventually make the complete page available in RAM, and any reads or writes to the page's data may occur after that.

All of this is done by the operating system's virtual memory manager. The operating system is free to map as many pages from a datafile into RAM as it thinks is good. For example, when a memory-mapped file is accessed sequentially, the operating system will likely be clever and read-ahead many pages, so they are already in physical RAM when actually accessed.

The OS is also free to swap out some or all pages of a datafile. It will likely swap out pages if there is not enough physical RAM available to keep all pages from all datafiles in RAM at the same time. It may also swap out pages that haven't been used for a while, to make RAM available for other operations. It will likely use some LRU algorithm for this.

How the virtual memory manager of an OS behaves exactly is wildly different across platforms and implementations. Most systems also allow configuring the VM subsystem. For example, here are some parameters for Linux's VM subsystem.

It is therefore hard to tell how much physical memory ArangoDB will actually use for a given number of collection and their datafiles. If the collections aren't accessed at all, having the datafiles memory-mapped might use almost no RAM as the OS has probably swapped the collections out fully or at least partially. If the collections are heavily in use, the OS will likely have their datafiles fully mapped into RAM. But in both cases the memory counts as memory-mapped. This is you can have a much higher virtual memory usage than you have physical RAM.

As mentioned before, the OS has to do a lot of work when accessing pages that are not in RAM, and you want to avoid this if possible. If the total size of your frequently used collections exceeds the size of the physical RAM, the OS has no alternative but to swap pages out and in a lot when you access these collections. Using an SSD for the swap will likely be better than using a spinning HDD, but is still far slower than RAM access. Long story short: the data of your active collections (datafiles plus indexes) should fit into physical RAM if possible, or you will see a lot of disk activity.

Apart from that, ArangoDB does not only allocate virtual memory for the collection datafiles, but it also starts a few V8 threads (V8 is the JavaScript engine in ArangoDB) that also use virtual memory. This virtual memory is not file-backed.

In an empty ArangoDB V8 accounts for most of the virtual memory usage. For example, on my 64 bit computer, the V8 threads consume about 5 GB of virtual memory (but ArangoDB in total only uses 140 MB RAM), whereas on my 32 bit computer with less RAM, the V8 threads use about 600 - 700 MB virtual memory. In your case, with the 4.5 GB VM usage, I suspect V8 is the reason, too.

The virtual memory usage for the V8 threads obviously correlates with the number of V8 threads started. For example, increasing the value of the startup parameter --server.threads will start more threads and use more virtual memory for V8, and lowering the value will start less threads and use less virtual memory.



标签: arangodb