What is the fastest way to swap two non-overlapping memory areas of equal size? Say, I need to swap (t_Some *a)
with (t_Some *b)
. Considering space-time trade-off, will increased temporary space improve the speed? For example, (char *tmp)
vs (int *tmp)
? I am looking for a portable solution.
Prototype:
void swap_elements_of_array(void* base, size_t size_of_element, int a, int b);
Obviously, you have to copy A to Temp, copy B to A, then copy Temp to B. You can do this all at once, for a small area, or do it in sections for a larger area, where you don't want to allocate such a large Temp value. The choice of section size is up to you, though consideration of alignment and cache issues appropriate for the hardware is important, for large, frequent moves.
(Well, actually there is another way, which doesn't require any temp space: XOR A with B, then XOR B with A, then XOR A with B. An old assembly programmer's trick.)
You could use the logic described here. This way, you could save a third buffer.
Even only this one temporary variable is enough to help the compiler optimize this.
But if you use such a temporary variable, you can do as well
In the first glance, both of them look expensive due to the many array accesses (in the 1st case) and the processing of only one byte per loop run, but if you let your compiler optimize this, it should be ok, as (at least gcc) is smart enough to bundle always 4 steps (in x64: even 16 steps) into one loop run.
Note that your compiler might not optimize so aggressively, so you might have to do the said splitting by yourself. In this case, take care about the alignment.
If the 2 memory areas are large and fit in integer number of memory pages then you can swap their Page Table Entries in order to swap their contents without using memcpy() or XORs.
In theory, with two large 2MiB pages, you need to write only 16 bytes of paging structures to swap their mapping in the virtual address space ...and hence their contents, too.
1GiB pages are possible on x86-64 CPUs in 64-bit mode and content of 2 such 1GiB memory blocks can also be swapped with writing only several bytes of paging structures.
The caveat of this method is that the access to paging structures requires Kernel Mode privileges or using shared memory mapping functions from User Mode.
With recent Meltdown patches (KPTI), transitioning to Kernel Mode from User Mode has become much more expensive. Probably too expensive to make 4kiB memory page swapps competitive with memcpy()...but if you have 2MB or larger memory blocks to swap, then swapping their Paging Structures is faster.
The speed for this will be partly platform dependent and only really borne out by testing.
Personally I'd favour creating a memory block of equal size to one of the arrays; use memcpy to swap the contents around, using the newly created memory block as swap space.
Now the size of the memory block will have an impact on the speed of operation (again platform dependent) and so you may find that for very large arrays swapping smaller amounts of data back and forth is faster than swapping a large chunk each time.
edit
In light of the comment let me explain, my last comment about swapping smaller amounts of data.
Your aim is to transfer
a
data tob
andb
data toa
using a temporary swap spacetmp
.The size of
tmp
is equal to or less than the size ofa
orb
and the number of iterations of swapping data increases as the size oftmp
is reduced e.g. iftmp
is a 10th of thea
then 10 iterations will be needed.Now in order to aid the speed of memcpy it is best to ensure that the arrays (a, b and tmp) are allocated aligned memory space.
The intention of the above fragment is to allow the highly optimised libc versions of memcpy (or inlining by the compiler) to take all the freedom they need. The alignment is crucial. If VLAs are not avalable (before C99) a macro can be composed, using a funky do-while.
Your best bet is to maximize registers usage so that when you read a temporary you don't end up with extra (likely cached) memory accesses. Number of registers will depend on a system and registers allocation (the logic that maps your variables onto actual registers) will depend on a compiler. So your best bet is I guess to expect only one register and expect its size to be the same as the pointer. Which boils down to a simple for-loop dealing with blocks interpreted as arrays of
size_t
.