I’m Not Dead yet; The Role of the Operating System in a Kernel-Bypass Era [pdf]

mbjorling · on April 12, 2019

It is worth mentioning that the Linux kernel has a new kernel API (io_uring) that changes the whole argument around using libos designs. With the new io_uring library (available with Linux kernel 5.1), peak IOPS per core is 1.7M IOPS... Which beats or is close to SPDK performance[0]. Later updates to the patches improves the throughput even more.

Jens (the author) has done a great writeup [1]

[0] https://lore.kernel.org/linux-block/20190116175003.17880-1-a... [1] http://kernel.dk/io_uring.pdf

benlwalker · on April 12, 2019

Jens' benchmark for SPDK quoted there is far off from the numbers we (the SPDK community) measure. We are able to replicate his io_uring numbers though, so we agree that the new interface is a large improvement. We're working to make full benchmarking data available shortly.

rambojazz · on April 12, 2019

Could you please elaborate more on io_uring vs libos? I would like to understand more but I don't really know how they compare...

bryanwb · on April 12, 2019

can you elaborate on how io_uring bypasses the kernel?

ncmncm · on April 12, 2019

io_uring dumps data directly into a ring buffer mapped into the user-level address space. User code is notified by (at least) an updated atomic counter. The user process must be finished with the data before the kernel comes around again to overwrite it. Often that demands the user process or thread is bound to a core which the OS has been forbidden to run anything else on, and the thread does a carefully circumscribed amount of work, rarely including memory allocation, i/o, or even system calls, that may cause it to be "lapped" by subsequent writes.

The idea is that the average time to process a packet absolutely must not exceed the average arrival rate, and the sum of spikes in arrival rate must average out over the size of the ring buffer to less than the process rate.

The hamster process pulling from the ring may just be load balancing to a herd of other threads operating under less stringent conditions, so they might be permitted i/o.

bryanwb · on April 13, 2019

tks for the great explanation!

ncmncm · on April 12, 2019

Every time somebody comes in with another abstraction scheme, my first thought is, "great, how do I bypass it?". Give me onload, I use ef_vi. Give me exasock, I use exanic.

The amount of code to operate at the lower level turns out to never be more than a hundred or two lines of code (much less for exanic), but always eliminates latency that comes from doing crap I will just need to undo, or redo differently.

AF_XDP and eBPF suggest the promise of making that code portable, yet running it in the NIC itself, possibly even eliminating a polling thread that uses up a whole core on the host, but it seems to need more support in the library available to the eBPF code. Specifically, it needs better access to (pre-permission-checked and mapped) DMA to host RAM, and precise, accurate timestamps. They don't need to be pretty, but they need to exist.

lukego · on April 12, 2019

Amen.

bryanwb · on April 12, 2019

interesting paper but where is the code? I am not familiar w/ DPDK but how would the control path of demikernel configure a dpdk device so that it can service multiple applications. perhaps this isn't necessary for DPDK?

It is hard to take this paper seriously if it is just a thought experiment and they haven't actually implemented Demikernel. In figure 3 there is a list of the syscalls but that isn't enough to convince me that they have actually implemented Demikernel.