RSS Bilingual Reader

The other day I wrote about a brute force approach to mapping IPv4 /24 subnets to Autonomous System Numbers (ASNs), where I built a big, somewhat sparse file of four-byte records, with the record for each /24 at a fixed byte position determined by its first three octets (so 0.0.0.0/24's ASN, if any, is at byte 0, 0.0.1.0/24 is at byte 4, and so on). My initial approach was to open, lseek(), and read() to access the data; in a comment, Aristotle Pagaltzis wondered if mmap() would perform better. The short answer is that for my specific case I think it would be worse, but the issue is interesting to talk about.

(In general, my view is that you should use mmap() primarily if it makes the code cleaner and simpler. Using mmap() for performance is a potentially fraught endeavour that you need to benchmark.)

In my case I have two strikes against mmap() likely being a performance advantage: I'm working in Python (and specifically Python 2) so I can't really directly use the mmap()'d memory, and I'm normally only making a single lookup in the typical case (because my program is running as a CGI). In the non-mmap() case I expect to do an open(), an lseek(), and a read() (which will trigger the kernel possibly reading from disk and then definitely copying data to me). In the mmap() case I would do open(), mmap(), and then access some page, triggering possible kernel IO and then causing the kernel to manipulate process memory mappings to map the page into my address space. In general, it seems unlikely that mmap() plus the page access handling will be cheaper than lseek() plus read().

(In both the mmap() and read() cases I expect two transitions into and out of the kernel. As far as I know, lseek() is a cheap system call (and certainly it seems unlikely to be more expensive than mmap(), which has to do a bunch of internal kernel work), and the extra work the read() does to copy data from the kernel to user space is probably no more work than the kernel manipulating page tables, and could be less.)

If I was doing more lookups in a single process, I could possibly win with the mmap() approach but it's not certain. A lot depends on how often I would be looking up something on an already mapped page and how expensive mapping in a new page is compared to some number of lseek() plus read() system calls (or pread() system calls if I had access to that, which cuts the number of system calls in half). In some scenarios, such as a burst of traffic from the same network or a closely related set of networks, I could see a high hit rate on already mapped pages. In others, the IPv4 addresses are basically random and widely distributed, so many lookups would require mapping new pages.

(Using mmap() makes it unnecessary to keep my own in-process cache, but I don't think it really changes what the kernel will cache for me. Both read()'ing from pages and accessing them through mmap() keeps them recently used.)

Things would also be better in a language where I could easily make zero-copy use of data right out of the mmap()'d pages themselves. Python is not such a language, and I believe that basically any access to the mmap()'d data is going to create new objects and copy some bytes around. I expect that this results in as many intermediate objects and so on as if I used Python's read() stuff.

(Of course if I really cared there's no substitute for actually benchmarking some code. I don't care that much, and the code is simpler with the regular IO approach because I have to use the regular IO approach when writing the data file.)

(One comment.)

Considering mmap() verus plain reads for my recent code