Description
When allocating and deallocating randomly sized memory chunks with 4 or more threads using openmp's parallel for construct, the program seems to start leaking considerable amounts of memory in the second half of the test-program's runtime. Thus it increases its consumed memory from 1050 MB to 1500 MB or more without actually making use of the extra memory.
As valgrind shows no issues, I must assume that what appears to be a memory leak actually is an emphasized effect of memory fragmentation.
Interestingly, the effect does not show yet if 2 threads make 10000 allocations each, but it shows strongly if 4 threads make 5000 allocations each. Also, if the maximum size of allocated chunks is reduced to 256kb (from 1mb), the effect gets weaker.
Can heavy concurrency emphasize fragmentation that much ? Or is this more likely to be a bug in the heap ?
Test Program Description
The demo program is build to obtain a total of 256 MB of randomly sized memory chunks from the heap, doing 5000 allocations. If the memory limit is hit, the chunks allocated first will be deallocated until the memory consumption falls below the limit. Once 5000 allocations where performed, all memory is released and the loop ends. All this work is done for each thread generated by openmp.
This memory allocation scheme allows us to expect a memory consumption of ~260 MB per thread (including some bookkeeping data).
Demo Program
As this is really something you might want to test, you can download the sample program with a simple makefile from dropbox.
When running the program as is, you should have at least 1400 MB of RAM available. Feel free to adjust the constants in the code to suit your needs.
For completeness, the actual code follows:
#include <stdlib.h>
#include <stdio.h>
#include <iostream>
#include <vector>
#include <deque>
#include <omp.h>
#include <math.h>
typedef unsigned long long uint64_t;
void runParallelAllocTest()
{
// constants
const int NUM_ALLOCATIONS = 5000; // alloc's per thread
const int NUM_THREADS = 4; // how many threads?
const int NUM_ITERS = NUM_THREADS;// how many overall repetions
const bool USE_NEW = true; // use new or malloc? , seems to make no difference (as it should)
const bool DEBUG_ALLOCS = false; // debug output
// pre store allocation sizes
const int NUM_PRE_ALLOCS = 20000;
const uint64_t MEM_LIMIT = (1024 * 1024) * 256; // x MB per process
const size_t MAX_CHUNK_SIZE = 1024 * 1024 * 1;
srand(1);
std::vector<size_t> allocations;
allocations.resize(NUM_PRE_ALLOCS);
for (int i = 0; i < NUM_PRE_ALLOCS; i++) {
allocations[i] = rand() % MAX_CHUNK_SIZE; // use up to x MB chunks
}
#pragma omp parallel num_threads(NUM_THREADS)
#pragma omp for
for (int i = 0; i < NUM_ITERS; ++i) {
uint64_t long totalAllocBytes = 0;
uint64_t currAllocBytes = 0;
std::deque< std::pair<char*, uint64_t> > pointers;
const int myId = omp_get_thread_num();
for (int j = 0; j < NUM_ALLOCATIONS; ++j) {
// new allocation
const size_t allocSize = allocations[(myId * 100 + j) % NUM_PRE_ALLOCS ];
char* pnt = NULL;
if (USE_NEW) {
pnt = new char[allocSize];
} else {
pnt = (char*) malloc(allocSize);
}
pointers.push_back(std::make_pair(pnt, allocSize));
totalAllocBytes += allocSize;
currAllocBytes += allocSize;
// fill with values to add "delay"
for (int fill = 0; fill < (int) allocSize; ++fill) {
pnt[fill] = (char)(j % 255);
}
if (DEBUG_ALLOCS) {
std::cout << "Id " << myId << " New alloc " << pointers.size() << ", bytes:" << allocSize << " at " << (uint64_t) pnt << "\n";
}
// free all or just a bit
if (((j % 5) == 0) || (j == (NUM_ALLOCATIONS - 1))) {
int frees = 0;
// keep this much allocated
// last check, free all
uint64_t memLimit = MEM_LIMIT;
if (j == NUM_ALLOCATIONS - 1) {
std::cout << "Id " << myId << " about to release all memory: " << (currAllocBytes / (double)(1024 * 1024)) << " MB" << std::endl;
memLimit = 0;
}
//MEM_LIMIT = 0; // DEBUG
while (pointers.size() > 0 && (currAllocBytes > memLimit)) {
// free one of the first entries to allow previously obtained resources to 'live' longer
currAllocBytes -= pointers.front().second;
char* pnt = pointers.front().first;
// free memory
if (USE_NEW) {
delete[] pnt;
} else {
free(pnt);
}
// update array
pointers.pop_front();
if (DEBUG_ALLOCS) {
std::cout << "Id " << myId << " Free'd " << pointers.size() << " at " << (uint64_t) pnt << "\n";
}
frees++;
}
if (DEBUG_ALLOCS) {
std::cout << "Frees " << frees << ", " << currAllocBytes << "/" << MEM_LIMIT << ", " << totalAllocBytes << "\n";
}
}
} // for each allocation
if (currAllocBytes != 0) {
std::cerr << "Not all free'd!\n";
}
std::cout << "Id " << myId << " done, total alloc'ed " << ((double) totalAllocBytes / (double)(1024 * 1024)) << "MB \n";
} // for each iteration
exit(1);
}
int main(int argc, char** argv)
{
runParallelAllocTest();
return 0;
}
The Test-System
From what I see so far, the hardware matters a lot. The test might need adjustments if run on a faster machine.
Intel(R) Core(TM)2 Duo CPU T7300 @ 2.00GHz
Ubuntu 10.04 LTS 64 bit
gcc 4.3, 4.4, 4.6
3988.62 Bogomips
Testing
Once you have executed the makefile, you should get a file named ompmemtest
. To query the memory usage over time, I used the following commands:
./ompmemtest &
top -b | grep ompmemtest
Which yields the quite impressive fragmentation or leaking behaviour. The expected memory consumption with 4 threads is 1090 MB, which became 1500 MB over time:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
11626 byron 20 0 204m 99m 1000 R 27 2.5 0:00.81 ompmemtest
11626 byron 20 0 992m 832m 1004 R 195 21.0 0:06.69 ompmemtest
11626 byron 20 0 1118m 1.0g 1004 R 189 26.1 0:12.40 ompmemtest
11626 byron 20 0 1218m 1.0g 1004 R 190 27.1 0:18.13 ompmemtest
11626 byron 20 0 1282m 1.1g 1004 R 195 29.6 0:24.06 ompmemtest
11626 byron 20 0 1471m 1.3g 1004 R 195 33.5 0:29.96 ompmemtest
11626 byron 20 0 1469m 1.3g 1004 R 194 33.5 0:35.85 ompmemtest
11626 byron 20 0 1469m 1.3g 1004 R 195 33.6 0:41.75 ompmemtest
11626 byron 20 0 1636m 1.5g 1004 R 194 37.8 0:47.62 ompmemtest
11626 byron 20 0 1660m 1.5g 1004 R 195 38.0 0:53.54 ompmemtest
11626 byron 20 0 1669m 1.5g 1004 R 195 38.2 0:59.45 ompmemtest
11626 byron 20 0 1664m 1.5g 1004 R 194 38.1 1:05.32 ompmemtest
11626 byron 20 0 1724m 1.5g 1004 R 195 40.0 1:11.21 ompmemtest
11626 byron 20 0 1724m 1.6g 1140 S 193 40.1 1:17.07 ompmemtest
Please Note: I could reproduce this issue when compiling with gcc 4.3, 4.4 and 4.6(trunk).
Ok, picked up the bait.
This is on a system with
Naive run
I just ran it
Nothing spectacular. Here is the simultaneous output of
vmstat -S M 1
Vmstat raw data
Does that information mean anything to you?
Google Thread Caching Malloc
Now for real fun, add a little spice
Looks faster, not?
In case you wanted to compare vmstat outputs
Valgrind --tool massif
This is the head of output from
ms_print
aftervalgrind --tool=massif ./ompmemtest
(default malloc):Google HEAPPROFILE
Unfortunately, vanilla
valgrind
doesn't work withtcmalloc
, so I switched horses midrace to heap profiling withgoogle-perftools
Contact me for full logs/details
Update
To the comments: I updated the program
It ran for over 5m3s. Close to the end, a screenshot of htop teaches that indeed, the reserved set is slightly higher, going towards 2.3g:
Comparing results with a tcmalloc run: 4m12s,
similar top statshas minor differences; the big difference is in the VIRT set (but that isn't particularly useful unless you have a very limited address space per process?). The RES set is quite similar, if you ask me. The more important thing to note is parallellism is increased; all cores are now maxed out. This is obviously due to reduced need to lock for heap operations when using tcmalloc:When linking the test program with google's tcmalloc library, the executable doesn't only run ~10% faster, but shows greatly reduced or insignificant memory fragmentation as well:
From the data I have, the answer appears to be:
Multithreaded access to the heap can emphasize fragmentation if the employed heap library does not deal well with concurrent access and if the processor fails to execute the threads truly concurrently.
The tcmalloc library shows no significant memory fragmentation running the same program that previously caused ~400MB to be lost in fragmentation.
But why does that happen ?
The best idea I have to offer here is some sort of locking artifact within the heap.
The test program will allocate randomly sized blocks of memory, freeing up blocks allocated early in the program to stay within its memory limit. When one thread is in the process of releasing old memory which is in a heap block on the 'left', it might actually be halted as another thread is scheduled to run, leaving a (soft) lock on that heap block. The newly scheduled thread wants to allocate memory, but may not even read that heap block on the 'left' side to check for free memory as it is currently being changed. Hence it might end up using a new heap block unnecessarily from the 'right'.
This process could look like a heap-block-shifting, where the the first blocks (on the left) remain only sparsely used and fragmented, forcing new blocks to be used on the right.
Lets restate that this fragmentation issue only occurs for me if I use 4 or more threads on a dual core system which can only handle two threads more or less concurrently. When only two threads are used, the (soft) locks on the heap will be held short enough not to block the other thread who wants to allocate memory.
Also, as a disclaimer, I didn't check the actual code of the glibc heap implementation, nor am I anything more than novice in the field of memory allocators - all I wrote is just how it appears to me which makes it pure speculation.
Another interesting read might be the tcmalloc documentation, which states common problems with heaps and multi-threaded access, some of which may have played their role in the test program too.
Its worth noting that it will never return memory to the system (see Caveats paragraph in tcmalloc documentation)
Yes the default malloc (Depending on linux version) does some crazy stuff which fails massively in some multi threaded applications. Specifically it keeps almost per thread heaps (arenas) to avoid locking. This is much faster than a single heap for all threads, but massively memory inefficient (sometimes). You can tune this by using code like this which turns off the multiple arenas (this kills performance so don't do this if you have lots of small allocations!)
Or as others suggested using various replacements for malloc.
Basically it's impossible for a general purpose malloc to always be efficient as it doesn't know how it's going to be used.
ChrisP.