I'm using a Xilinx Zynq platform with a region of memory shared between the programmable HW and the ARM processor.
I've reserved this memory using memmap on the kernel command line and then exposed it to userspace via mmap/io_remap_pfn_range calls in my driver.
The problem I'm having is that it takes some time for the writes to show up in DRAM and I presume it's stuck in dcache. There's a bunch of flush_cache_* calls defined but none of them are exported, which is a clue to me that I'm barking up the wrong tree...
As a trial I locally exported flush_cache_mm and just to see what would happen and no joy.
In short, how can I be sure that any writes to this mmap'd regions have been committed to DRAM?
Thanks.
The ARM processors typically have both a I/D cache and a write buffer. The idea of a write buffer is to gang sequential writes together (great for synchronous DRAM) and to not delay the CPU to wait for a write to complete.
To be generic, you can flush the d cache and the write buffer. The following is some inline ARM assembler which should work for many architectures and memory configurations.
static inline void dcache_clean(void)
{
const int zero = 0;
/* clean entire D cache -> push to external memory. */
__asm volatile ("1: mrc p15, 0, r15, c7, c10, 3\n"
" bne 1b\n" ::: "cc");
/* drain the write buffer */
__asm volatile ("mcr 15, 0, %0, c7, c10, 4"::"r" (zero));
}
You may need more if you have an L2 cache.
To answer in a Linux context, there are different CPU variants and different routines depending on memory/MMU configurations and even CPU errata. See for instance,
- proc-arm926.S
- cache-v7.S
- cache-v6.S
- etc
These routines are either called directly or looked up in a cpu info structure with function pointers to the appropriate routine for the detected CPU and configuration; depending on whether the kernel is special purpose for a single CPU or multi-purpose like a Ubuntu distribution.
To answer the question specifically for your situation, we need to know L2 cache, write buffered memory, CPU architecture specifics; maybe including silicon revisions for errata. Another tactic is to avoid this completely by using the dma_alloc_XXX()
routines which mark memory as un-cacheable and un-bufferable so that the CPU writes are pushed externally immediately. Depending on your memory access pattern, either solution is valid. You may wish to cache if the memory only needs to be synchronized at some checkpoint (vsync/*hsync* for video, etc).
I hit the exact same problem, on zynq. Finally got L2 flushed/invalidated with:
#include <asm/outercache.h>
outer_cache.flush_range(start,size);
outer_cache.inv_range(start,size);
start is a kernel virtual space pointer. You also need to flush L1 to L2:
__cpuc_flush_dcache_area(start,size);
I'm not sure if invalidating L1 is needed before reading, and I haven't found the function to do this. I assume it would need to be, and I've thus far only been lucky...
Seems any suggestions on the 'net that I found assume the device to be "inside" of the L2 cache coherency, so they did not work if the AXI-HP ports were used. With the AXI-ACP port used, L2 flushing was not needed.
(For those not familiar with zync: the HP-ports access the DRAM controller directly, bypassing any cache/MMU implemented on ARM side)
I'm not familiar with Zynq, but you essentially have two options that really work:
- either include your other logic on the FPGA in the same coherency domain (if Zynq has an ACP port, for example)
- or mark the memory you map as device memory (or other non-cacheable if you don't care about gather, reorder and early write acknowledge) and use a DSB after any write that should be seen.
If the memory is marked as cacheable and your other observer is not in the same coherency domain you are asking for trouble - when you clean the D-cache with a DCCISW or similar op and you have an L2 cache - that's where it'll all end up in.