Linux sockets: Zero-copy local, TCP/IP remote

Networking is my worst area in operating systems, so forgive me for asking perhaps an incomplete question. I've been reading about this for a few hours, but it's kinda swimming in my head. (To me, I feel like chip design is easy compared to figuring out networking protocols.)

I have some networked services that communicate with each other via sockets. Specifically, the sockets are created with fd = socket(PF_INET, SOCK_STREAM, 0);, which automatically gets TCP/IP. I need this as the base case, because these services may be running on separate machines.

But for one project, we're trying to squeeze all of them into an underpowered embedded 'appliance', based on an Atom Z530P, so it seems to me that the memory copy overhead is something we could optimize out. I've been reading about that here: data-link-access-and-zero-copy and Linux_packet_mmap and packet_mmap.

For this case, one would create the socket something like this: fd = socket(PF_PACKET, PF_RAW, 0);. And there's a bunch of other stuff to do, like allocating ring buffers, mmapping them, associating them with the socket, etc. It looks like you're restricted to using sendto and recvfrom in order to transmit data. As I understand it, since the socket is local, you don't need a reliable "stream" type socket, so raw sockets is the appropriate interface, and I'm guessing that the ring buffer is used at page granularity, where each packet (or datagram) starts at a page boundary.

Before I spend a huge amount of time trying to investigate this further, I was hoping some helpful individuals might help me with some questions:

How much performance benefit should I expect to get here from zero-copy sockets? I think the last I checked, we were moving an maximum of like 40 MB/sec from one process to another and finally to the disk. In the most basic scenario, data moves from the capture process, to the one-to-many process (others can listen in on the stream), to the archiver process that writes to disk. That's two hops not counting the disk and internal stuff.
Does Linux do any of this automatically, optimizing for processes running on the same machine?
In any case, I would have listening sockets in TCP ports. Can I use those to make connections between processes yet still be able to use zero-copy? In other words, can I use AF_INET with PF_PACKET?
Is PF_PACKET with SOCK_RAW the only valid configuration for zero-copy sockets?
Is there any good sample code out there that will use zero-copy with TCP/IP as a fallback?
What's the simplest or best way to detect that the two processes are on the same machine? They know each other's IP addresses, so I could just compare and use different code paths for each. Is there a simpler way to do this?
Can I use write() and read() on a packet-based socket, or are those only valid for streams? (Rewriting how connections are made would be simpler then rewriting ALL of the socket code.)
Am I over-complicating things and/or optimizing the wrong thing? OProfiler tells me that most CPU time is spent in two places: (1) zlib, and (2) the kernel, which I can't profile since I'm using CentOS 6.2, which doesn't provide a vmlinux. I assume the kernel time is a combination of idle time and data copying and not much else.

Thanks in advance for the help!

回答1:

Am I over-complicating things and/or optimizing the wrong thing?

Possibly. Using PF_PACKET sockets is only for specialized stuff. You probably want to look into

sendfile(2)
splice(2)

What's the simplest or best way to detect that the two processes are on the same machine?

Simply not "forgetting" this information.

Does Linux do any of this automatically, optimizing for processes running on the same machine?

No, you have to do it yourself.

回答2:

I think the choice between TCP/IP and raw packets is much more important than the zero-copy question. If you need reliable stream-based communication, you need TCP/IP (that is, AF_INET+PF_STREAM). Trying to implement a reliable stream over unreliable packets is very comlicated, and it's already done for you.

The best way to use TCP/IP with zero copy and files is, as @cnicutar says, sendfile(2) and splice(2). I think there's a way to enjoy zero-copy without these (if you want to read data into memory, not directly to a file), but I'm not sure how to do it.

Also, Centos is open source, so you can get a vmlinux file by downloading the source and compiling it.