In the original vmsplice()
implementation, it was suggested that if you had a user-land buffer 2x the maximum number of pages that could fit in a pipe, a successful vmsplice() on the second half of the buffer would guarantee that the kernel was done using the first half of the buffer.
But that was not true after all, and particularly for TCP, the kernel pages would be kept until receiving ACK from the other side. Fixing this was left as future work, and thus for TCP, the kernel would still have to copy the pages from the pipe.
vmsplice()
has the SPLICE_F_GIFT
option that sort-of deals with this, but the problem is that this exposes two other problems - how to efficiently get fresh pages from the kernel, and how to reduce cache trashing. The first issue is that mmap requires the kernel to clear the pages, and the second issue is that although mmap might use the fancy kscrubd feature in the kernel, that increases the working set of the process (cache trashing).
Based on this, I have these questions:
- What is the current state for notifying userland about the safe re-use of pages? I am especially interested in pages splice()d onto a socket (TCP). Did anything happen during the last 5 years?
- Is
mmap
/vmsplice
/splice
/munmap
the current best practice for zero-copying in a TCP server or have we better options today?