I'm working on a system that involves some custom hardware and a custom Linux device driver I wrote for the hardware. The system occasionally needs to move large amounts of data very rapidly and therefore my driver dynamically (i.e. when needed) allocates large (1 GB) DMA buffers which are used and then freed when they are no longer needed. To allocate such large buffers I actually allocate a bunch of smaller buffers (256 X 4MB) using dma_alloc_coherent
and then map them contiguously into user space using remap_pfn_range
. This works very well most of the time.
During testing, after the system has been running test cases for a long time, I sometimes see DMA allocation failures where one of the dma_alloc_coherent
calls in my driver fails which causes my application layer software to crash. I was finally able to track down this problem and I discovered that when I see DMA allocation failures the Linux kernel page cache is very full.
For example, on the last failure that I captured the page cache filled 27 GB of the 32 GB of RAM on my system. I suspected that the page cache "fullness" was causing dma_alloc_coherent
calls to fail. To test this theory I manually emptied the page cache using:
# echo 1 > /proc/sys/vm/drop_caches
This dropped the size of the cache from 27 GB to 94 MB and I was able to allocate 20+ 1 GB DMA buffers with no issues.
Clearly the page cache is a beneficial thing so I would prefer not to have to completely empty it every time I run out of space when allocating DMA buffers. My questions is this: how can I dynamically shrink the page cache in kernel space such that if a call to dma_alloc_coherent
fails I can recover just enough space so that I can retry the call and have it succeed?
My system is x86_64 based running a 3.16.x Linux kernel.
I have found some vague references that suggest what I'm attempting may be possible, for example "These objects are automatically reclaimed by the kernel when memory is needed elsewhere on the system." (from: https://www.kernel.org/doc/Documentation/sysctl/vm.txt). But I have not yet found any specifics that indicate how the memory is reclaimed.
Any assistance with this would be greatly appreciated!
Technically when certain allocation fails then Kernel will try to free memory.Depending upon memory failures(soft failure/hard failure). Hard failures causes Kernel to enter into direct reclaim path. Direct reclaim is costly operation which might take undefined time to complete and even after that allocation might fail.
Here you have two options:
1) Play with VM settings like dirty_ratio,dirty_background_ratio etc to maintain free ram. see : https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/s-memory-tunables.html
2) Write a kernel daemon, which calls kernel function which handles drop_cache (because drop_cache migh sleep).
TL;DR : Scan for active superblocks and drop references to non-dirty ones until you have reclaimed as much system memory as you need. (or you finally run out of references to active superblocks.)
To answer this question, let us take a look at what the "
drop_caches
operation" did to reduce the fs page-cache from 27GB to 94MB on your system.echo 1 > /proc/sys/vm/drop_caches
invokes
drop_caches_sysctl_handler()
which in turn invokes
iterate_supers()
andpasses it the pointer to the function
drop_pagecache_sb()
.What happens next is that
iterate_supers()
scans for active superblocks and everytime it finds one, it callsdrop_pagecache_sb()
, passing it a reference to the active superblock.This iterative procedure continues until references to all the active superblocks are freed from the fs page-cache. This is a non-destructive operation and will only free blocks that are completely unused. Dirty-objects will continue to be in use until written out to disk and are not free-able. If you run
sync
first to flush them out to disk, the "drop_caches
operation" tends to free more memory.A couple of points to keep in mind to further optimise this procedure :
Is there a preference for certain block devices over others?
You may want to iterate over active superblocks of the block devices that you do not care about first. If enough memory is not reclaimed, then scan the block devices that you would prefer to retain in the fs page-cache unless absolutely necessary to reclaim required memory. get_active_super() might be of help here.
iterate_supers_type() seems interesting
It allows one to iterate over superblocks of specific file_system_type
Please note that this is a speculative solution based purely on the analysis of existing code within the Linux kernel that you have observed to already solve your problem. Once the above approach is implemented, it will only allow you to control the same i.e. attempt to reclaim fs page-cache memory only to the extent required for your immediate needs.