Understanding the code inside Tornado, the asynchronous web server

ZoFreX · on May 7, 2011

> Lets say you have 20 threads. You improved performance 20 times, so the rate is now 4 request per second. Still, way too small. You can keep throwing threads at the problem, but threads are expensive in terms of memory usage and scheduling. I doubt you’ll ever reach hundreds of requests per second this way.

Let's say I'm using Java instead of Python. Lets say I use a lot more threads than 20. I will reach thousands of requests per second this way. There are real problems with that approach, and event-driven approaches have genuine advantages, but can people please stop straight-up lying and saying that you can't just throw a few thousand threads at a problem, because you can, and people do.

barrkel · on May 7, 2011

Something I'd like people to understand, at a very deep level: event-based async programming, and blocking thread-based programming, are fundamentally the same. The problem of maintaining your state in between event triggers is explicit in the first style; it's implicit, in the thread stack and CPU instruction pointer, in the second style. In the second style, it's the OS which is running the event loop, and it dispatches events by resuming a continuation - a continuation that starts by "returning" from the blocking call.

Much of the translation from blocking style to event-based style is moving the work of dispatching and looping from kernel to userland. Other ancillary benefits, like reduced address space usage by blocked threads, are in principle also achievable in a threading model - e.g. by storing stack frames on the heap and being more aggressive about collecting them (assuming GC).

Other benefits of async - such as overlapping work - are also fairly trivially possible with threading, though less deterministic.

_delirium · on May 7, 2011

That's essentially the thesis of this 1978 paper, though they group evented programming under the broader "message-passing" style: http://www.sics.se/~adam/pt/duality78.pdf

barrkel · on May 7, 2011

Indeed. And the corollary is that every possible performance benefit of non-blocking code is in principle achievable with threaded code, though some of the scheduler may need to be implemented in userland, with more explicit hinting of things like lifetimes of local variables in stack frames, etc., to get there. The true benefit of event-oriented network programming may be simply in what it makes explicit vs implicit. By making the tracking of state awkward, it forces its minimization and consolidation. By a process of Worse is Better, it ends up being more performant for scaling despite (usually) being more awkward to program (especially if you have any kind of nesting or recursion structure in your protocol).

ankrgyl · on May 8, 2011

Remember that modern implementations of kernel threads are designed to be general-purpose. So, while a threaded model would solve the problems, it also incurs necessary overhead (that works towards solving irrelevant problems). The event-driven approach can be thought of as minimalistic user-space threads for this context. Libraries like Python Eventlet (http://eventlet.net/) and CML (http://cml.cs.uchicago.edu/) are trying to reabstract this into the conveniences of thread-style programming.

wildmXranat · on May 7, 2011

I agree with you mostly, and I think there is a misunderstanding of the terms. It's possible to serve large amounts of requests with threads as well as with event loops, but there is a distinction between the two on how they consume resources. The author is wrong in assuming threads will not scale, as they will, but a Java web framework using threads will eat up hundreds of megs under heavy load, while an event loop will keep on churning away with progressively longer response times. It's not so much a case of lying as much not knowing when to stop presuming instead of investigating the actual facts.

ZoFreX · on May 7, 2011

There are resource issues either way, in both cases you have to store state. With threads this can quickly become expensive if you keep the stack size for all those threads at the default amount (a whopping 48K if memory serves). I would say it's easier to use smaller amounts of memory with event-driven programming, but you still have to keep some data around per-client. Of course, if the situation is very asynchronous (seconds instead of milliseconds) then you don't necessarily need all of this hanging around in RAM.

bdarnell · on May 7, 2011

Agreed. If requests per second is the metric you're optimizing for, then threads are a perfectly fine solution (see e.g. Paul Tyma's presentations about how threads are superior to event-driven approaches in Java). The strongest case for event-driven servers is for cases like long polling, where a connection will sit idle for a minute or more and the number of connected idle clients becomes an important metric.

ankrgyl · on May 7, 2011

Even with Python (CGI backed by Apache), you can scale with threads. The issue of scaling with threads vs. events is a pretty hot debate, and I think the author sets up this kind of criticism by addressing it poorly. Personally, I fall into the event-driven camp, because it involves the operating system as little as possible (only file descriptors). The C10K link has a great overview of the threaded approach (http://www.kegel.com/c10k.html#threaded). Read through the rest of the page for a discussion on the pros/cons of other approaches.

cdavid · on May 7, 2011

Another issue with evented frameworks in python is that you have to ensure nothing is blocking anywhere in your code, which becomes harder the more complex your application becomes. People will often mention IO, database, etc... forgetting that it is also an issue if your request handler takes CPU for N ms (e.g. encoding a relatively large payload in json, etc...). everything needs to be written with async in mind.

Languages/frameworks where async IO is an implementation detail (like some web frameworks in haskell) don't have this issue.

ankrgyl · on May 8, 2011

I don't think of this as a problem. Tornado is just a small and lightweight non-blocking HTTP server with a couple of convenient libraries. There are some pretty awesome, elegant solutions to the problem you point out, and I like to think of these as extensions to the small Tornado framework.

One really cool one is http://thomas.pelletier.im/2010/08/websocket-tornado-redis/, which demonstrates how to use threading to support Redis pub/sub.

For heavy computational tasks that aren't time-critical, you could have an accessory worker thread that chugs queued computations (yay first-class functions) in computational downtime.

ntoshev · on May 8, 2011

Tornado devs recommend using blocking mysql drivers and disk I/O. These operations have relatively low and predictable latencies. You can run more Tornado processes than the number of cores is if CPU is underutilized.

jrockway · on May 8, 2011

Well, actually, it involves the operating system rather heavily. select, epoll, kqueue, etc., etc., are not user-space things, after all, and if the OS implementation is bad, you are going to run into trouble.

ankrgyl · on May 8, 2011

Yes, they of course involve the OS, but are much lighter than overhead from spawning a thread. Furthermore, what's the point in considering bad OS implementations? Linux is free and does a good enough job, so that should be the baseline for considering performance.

ZoFreX · on May 7, 2011

I'm not entirely in either camp, I don't think event-driven programming is worse or even bad. I'm just sick of false arguments in either direction! (Especially as a Java developer who tried to learn nio only to find out that using lots of threads was fine all along... not a pleasant journey!)

jrockway · on May 8, 2011

But I'll bet that a node.js HTTP server handling 10,000 slow downloads is going to use a lot less memory than 10,000 Java+OS threads handling those same downloads.

ericflo · on May 7, 2011

Yeah, whenever people say threads don't work, point them to Varnish. It uses threads intelligently and everyone agrees it's fast and awesome.

thejbf · on May 7, 2011

Yes, but I can still spawn thousands of event-based Scala actors on the same machine. The actual problem with author: He believes running long and costly operations on an HTTP request is OK which is never and ever right. If you need to wait 5 seconds to perform a task, make it asynchronous.

jparise · on May 7, 2011

Some of this is a bit dated. It looks like it was written in late 2009, judging by the first comment. For example, IO loop timers are now stored using a heapq[1] instead of a sorted list, and many of the other issues have been similarly addressed over the last year and a half.

Still, this is a good general overview of Tornado's internals.

[1] https://github.com/facebook/tornado/commit/b6c4d6d20196fa4fe...

bdarnell · on May 7, 2011

The heapq change was just last week and has not yet been in any release, so that's a perfectly reasonable omission. :) But yes, this is about a pre-1.0 release, so I would encourage anyone interested in exploring tornado internals to check out a more recent release.

_iv38 · on May 8, 2011

There was a very good talk about this at #nodeconf this week by Tom from Joyent. The basic summary was that threaded network servers, like Apache, say, approach the speed problem by "pre-allocating" resources, a chunk for each thread/instance. The issue is that there are a finite number of resources on each machine (RAM mostly), and blocked threads are eating up a slice of those resources even when doing no work. This sets the upper bound of number of simultaneous connections that can be handled.

In an event based system the overhead for each connection is at least three orders of magnitude lower, sometimes four or five (hope I'm remembering this right). This translates into _considerable_ increase in number of connections that can be handled simultaneously, not quite the equivalent number of orders of magnitude due to other aspects of the system becoming more of the bottle-necks, but still dramatic.

This was a data-driven observation, not conjecture or assertion. I listened in during a break while the nodejs guys argued about approaches to getting TLS working better - they were worrying about the 1MB overhead for a TLS connection because that's a significant percentage of a connection handler. Think about that for a minute in the context of an apache threaded instance. 1MB matters? wow!

I'm new to this area and this was very interesting stuff and seemed related to the various discussions below about a threaded system can do what an event based system can do.

rdtsc · on May 7, 2011

> I also feel obliged to talk a bit on nonblocking IO or asynchronous IO (AIO)

Wait, that is not correct. His IO is blocking. He gets a notification when data is ready, from the select/poll/epoll (the asynchronous part), but when IO is actually performed (read, write, recv, etc), the operation it is still happening in the main thread in user space, and it blocks it.

Currently only the file system (I think) has truly asynchronous and non-blocking IO. It is provided by the aio_* set of system calls and is has been sort of an exotic beast (it is not that popular).

Here is a good chart of the possible IO types and their combinations:

http://www.ibm.com/developerworks/linux/library/l-async/

And of course:

http://www.kegel.com/c10k.html

justincormack · on May 7, 2011

No the network sockets are set to non block and will return E_AGAIN once you have emptied the kernel buffer. Thats still non blocking.

I dont think he is using aio_* which Linux does not really implement usefully.

kingkilr · on May 7, 2011

Python doesn't provide an API to aio_* because they're basically useless (on Linux at least).

rdtsc · on May 7, 2011

No. They would return E_AGAIN after the data has been read or written in blocking mode. If select comes back and says "read is ready on socket 4" then user code does read on socket 4 and gets (say) 4K of data followed by E_AGAIN. Getting that 4K of data happens in a blocking mode. To make it non-blocking you would provide a pointer to your user space buffer to the kernel and it would determine when IO is ready _and_ proceed to put that data in the buffer.

haberman · on May 7, 2011

The way you are using the term "blocking" is non-standard and not that useful. You are calling the syscall "blocking" just because it performs its work inline (in this case, copying a kernel buffer into a user-space buffer). By this definition, every syscall is blocking, even aio_read() because even aio_read() performs some work inline (namely enqueing a read request).

In common usage, an I/O operation is considered "blocking" if the system call does not return until data is available. If on the other hand you set O_NONBLOCK on a fd, you will get EAGAIN when no data is available, so clearly that is the accepted meaning of "nonblocking."

rdtsc · on May 7, 2011

> You are calling the syscall "blocking" just because it performs its work inline.

Yes because there is _another_ mode of IO where it doesn't have to do that. So the reason that is called "blocking" is because there is another mode of operation (albeit it is exotic) where the process does not block while performing IO -- the kernel reads/writes the data on behalf of the process. If this second mode of performing IO did not ever exist you could just interchange arbitrarily asynchronous and non-blocking as synonyms.

Since you didn't bother reading the link I posted here is the basic matrix of IO operations:

                    Synchronous            Asynchronous

    Blocking         read/write           select/[e]poll

 
    Non-Blocking    O_NONBLOCK+E_AGAIN      aio_*

There has been talk and various attempt at implementing aio_* style IO for network sockets on Linux, but nothing good so far.

> Even aio_read() performs some work inline (namely enqueing a read request).

Enqueuing the read request is not the same as actually copying gigabytes or terabytes of data from disk where the actual work is performed.

tptacek · on May 7, 2011

I think the DeveloperWorks article you've cited is simply wrong, and that's sent you into a tailspin. Select and poll aren't "ways to implement asynchronous blocking I/O".

I think you'll find these links more helpful than the misleading one you've been using.

http://www.circlemud.org/~jelson/software/fusd/docs/node36.h...

http://people.freebsd.org/~hmp/stuff/docs/freebsd_kse.pdf

I could always be wrong about this stuff; maybe I'm the one with the broken semantics. But I'm pretty sure I'm not.

haberman · on May 7, 2011

> Since you didn't bother reading the link I posted here is the basic matrix of IO operations.

Actually I did. You are misinterpreting what it is saying. The reason it is calling the select() model "blocking" is because you block during the select(), not because the read() itself is blocking.

This is also non-standard usage: most people wouldn't refer to a select()-based loop as blocking I/O, because a program generally only calls select() when it has nothing else to do but service I/O, so the select() does not "block" the application.

tptacek · on May 7, 2011

A process blocks when it needs to be rescheduled, yielding back to the process scheduler pending I/O. That's what "block" means. It doesn't mean "all the time consumed by any operation performed by the process". For instance, a process doesn't "block" when a hash table lookup unexpectedly trends O(n) due to collisions.

On a nonblocking socket, the I/O operations you're talking about are simple u/k and k/u buffer copies.

I think between aio_+ and the C10k page, you may have gotten a bit scrambled. Whether you have an event loop or not, disk I/O is often transparently blocking, even when you try to set descriptors nonblocking. But I/O operations on a nonblocking socket don't wait the process. If there's data in the buffer, you get the data; if there isn't, you get the error.

rdtsc · on May 7, 2011

> On a nonblocking socket, the I/O operations you're talking about are simple u/k and k/u buffer copies.

But because there is an IO mode where these copies are done by the kernel when data arrives not by the user process (after a kernel notification), there is now a differentiation between blocking and asynchronous. Asynchronous refers to IO readiness notifications, "blocking" refers to the copying of data (the actual IO if you will).

> A process blocks when it needs to be rescheduled, yielding back to the process scheduler pending I/O. That's what "block" means.

However when your process is copying data it is in the running state and preventing other processes from running. It could be doing something else or it let other processes run in the meantime.

> The I/O operations you're talking about are simple u/k and k/u buffer copies.

That is true, however if the data comes in very small chunks very fast you are doing a lot of switching to user space and a lot of small context switches to, say copy 1K of data. When you could just request that the kernel fill up your 10MB buffer with data from a socket and tell you when it is ready. If there is anything I learned is to never just say "it is a simple copy". Today's memory is not very fast compared to CPU speeds and copying is not something to be taking lightly. It is one thing when looking at a toy example, another thing when dealing with realtime systems or large data sets.

barrkel · on May 7, 2011

A blocked process is a process which is not in a running or ready state in the kernel. The userland process entering the kernel on its stack isn't sufficient to change this state; nor is starting or finishing a memory copy operation, particularly one which is done on the CPU (things might be different if DMA were involved). Blocked, running, ready etc. are process scheduling concepts and are about sharing a limited resource, specifically the CPU. A process busy copying memory, but using the CPU, is not blocked, because another process may not be running on that CPU instead.

tptacek · on May 7, 2011

I don't believe that "blocking" refers to the copying of data. I don't believe that a process can be "blocking" and "running". Where are you getting this from?

rdtsc · on May 8, 2011

I guess "blocking" is overloaded here. To me it refers to the process of performing IO, but you think about process states from the point of view of the scheduler (as in RUNNING state vs BLOCKING (or IOWAIT)).

Yes, if aio_* system was never invented, we wouldn't be arguing, but because it exists it introduces a new possible way of doing IO. It is a general enough way of doing IO and at hardware level we have DMA but in the land of syscalls we have aio_.

Ok, just curious, what words would you use to describe what aio_ calls do?

Over the years many have thought of a way to bring that style of IO to networking. So far it works for disk IO very well. For example take a look at this benchmark from lighttpd for a sendfile:

http://blog.lighttpd.net/articles/2006/11/09/async-io-on-lin...

It looks like there is a consistent 50% improvement in speed when using aio with a 1MB block.

Wouldn't it be nice to have that kind of improvement for network sockets as well? I think it would be and there have been many attempts over the years but nothing good yet has happened.

tptacek · on May 8, 2011

It's not overloaded. Respectfully, you simply don't understand what it means.

FooBarWidget · on May 8, 2011

It would be good to have more I/O API calls available that don't require as much copying but so far there simply aren't any such APIs that are mature and stable. POSIX AIO is only supported in a very limited way; on Linux it only works for files, not sockets. splice() only works between files/pipes and TCP sockets but not Unix domain sockets. Etc etc.

paisawalla · on May 7, 2011

I think you are confusing the definitions of non-blocking IO and memory-mapped IO.

rdtsc · on May 7, 2011

I am not talking about in-kernel hardware mechanisms like DMA and such. Or about memory mapping a file. It is about the system call interface to perform IO. Yeah, I guess you can call it memory-mapped but I don't think that is quite accurate in this context.

tptacek · on May 7, 2011

The system call interface to perform I/O cannot be described as "memory mapped"; the read/write system calls are programmed I/O. The opposite of memory mapped. You're letting terminology get you into trouble. You should just let go of your definitions and make the broader point you're trying to make, which might still be salvageable.

rdtsc · on May 8, 2011

I agree, "memory mapped" is the wrong term. I was just letting the original poster who called it this way keep that as a concession or a mental model. That was my mistake as it just confuses everyone even more.

ZeSmith · on May 7, 2011

Reminds me of this: http://teddziuba.com/2009/09/twisted-vs-tornado-youre-both.h...

maurycy · on May 7, 2011

Personally, I would be more interested in understanding the code inside Tornado, a column of air. :-)