Why is the RX ring of a Linux raw socket limited t

2019-07-25 17:06发布

问题:

Background

I'm trying to mmap() the RX ring buffer of a raw socket in my 64-bit Linux application. My ring consists of 4096 blocks of size 1MB each for a total of 4GB. (Note there can be many frames in each 1MB block. See this doc for background if you're curious.)

Problem

Unfortunately, there appears to be a 4GB limit on the size of the RX ring buffer when configuring it using setsockopt() with option PACKET_RX_RING. The implication for me is that I cannot increase either my block size or my ring size any further. My application would benefit from an increase though.

This limit is at least enforced in modern versions of the Linux kernel. See the source here.

(Note that there is no issue mmap()ing larger than 4GB. I originally stumbled into this question.)

Question

Why is there a limit of 4GB for the RX ring buffer? If it's a bug when migrating this part of the kernel from 32-bit, that would be cool and might be straightforward to patch. But, if there's a more fundamental reason, I'm curious to know what that may be.

回答1:

This was a bug. I submitted a patch to fix this (now merged): https://github.com/torvalds/linux/commit/fc62814d690cf62189854464f4bd07457d5e9e50



回答2:

This is a partial answer. Let's examine how the PACKET_MMAP ring buffer structure and its buffers get allocated. This is done at line 4270 by calling alloc_pg_vec. The structure itself is just an array of pointers and is allocated using kcalloc, which imposes an upper limit of 131,072 bytes. If each pointer is 8 bytes in size, then there can be at most 131,072/8 = 16,384 = 214 blocks. Each block is allocated using alloc_one_pg_vec_page. Note that each block must be contiguous in memory and its size must be a multiple of the page size (4096 bytes). The order parameter passed to alloc_pg_vec represents the number of pages to be allocated as a power of two for each buffer. There is a limit on the order defined in mmzone.h, which is MAX_ORDER = 11. Therefore, the maximum block size is 4096 * 211 = 223 = 8 MiB. This means that the ring overall is limited to 237 bytes or 128 GiB.

I don't know why the check req->tp_block_size > UINT_MAX / req->tp_block_nr is performed. It could be because older versions of Linux supported smaller numbers of buffers and buffer sizes (MAX_ORDER was 10 in Linux 2.4). But perhaps by removing the check, you can allocate rings up to 128 GiB in size.