SEDA Paper Synopsis

SEDA Paper Synopsis

Someone recommended that I read the SEDA architecture paper, which talks about building systems that gracefully degrade under extreme load. After doing so, I felt the claims were overexaggerated and the ideas rather simplicistic; however, it’s possible that at the time the paper was written, the ideas they laid out were novel.

The paper starts off by talking about the problem with one-thread-per-request, which should seem obvious to anyone who has dealt with multithreading before – if you get a flood of requests, you will spawn so many threads that they will spend much of their time competing for CPU share. As a result, the throughput of the system actually degrades under high load.

Next, the authors move on to discuss the common solution to the problem: a threadpool. In this model, you have a set number of threads, and extra requests are queued until an available thread is ready. However, it has one disadvantage: if the queue is very long, it may take a very long time to respond to a request.

The next sections talk about event-driven concurrency, where there is an event scheduler that handles data coming from the network / disk / etc., and dispatches it to finite state machines that run on a fixed number of threads. They claim this solves the problems of the threadpool, but it’s not clear to me how it does. I suppose that the event scheduler could determine if different requests take different processing time, and reschedule the order (or send back a failure message early). However, in the simple case of all messages being the same, it cannot perform any better than a threadpool. The paper does not mention if the prioritization of different messages is a key point. They also mention that event-driven concurrency has its own set of issues, like programming the scheduler and ensuring that the finite state machines do not have any blocking operations.

At this point, I should mention the paper suffers from poor writing, like most academic publications.

The final background section discusses structured event queues, which appears to be event-driven concurrency but with more queues inbetween the processes. A clear explanation is not given.

The paper proceeds with the SEDA architecture, which is a collection of threadpools connected by queues, with a threadpool and batch controller (for auto-tuning). The authors spill much more ink describing this system, though. Each threadpool is responsible for one aspect of the request, and after processing, they pass the data to the next threadpool via the queues. The SEDA architecture fixes all the issues previously mentioned (or so they claim), is (supposedly) very fast, and generic enough to be used in a wide variety of situations. They also delve into trade-offs of batching (e.g., how many queued requests should a thread consume when it wakes up) and how the system can auto-tune itself by looking at threadpools and their queue sizes. They perform some simple experiments where they claim their Java implementation of SEDA beats out Apache.

I am skeptical of these results for several reasons.

Let’s start off with the speed claims. First, the experiment was very simplistic (reading a file repeatedly and serving it). Second, it’s roughly 15 years later and I do not think any modern web server uses this architecture. Third, Apache is written in C and their implementation was written in Java circa 2001, which was quite slow back then. I believe it’s more likely that they tested under a scenario that was quite favorable to them, and did not test under more realistic scenarios that would have pummeled their server.

Finally, their auto-tuning seems too good to be true. In the paper, they talk about this feature makes their system simpler to use while still being very performant. However, the algorithms they describe seem too simple to be resilient. For example, the batch controller maintains a moving average of throughput for each threadpool; it will decrease the factor until throughput starts to degrade. After that, it will increase the batching factor if throughput starts to degrade again. Sudden drops in load will reset the batching factor to maximum. Why is this a good strategy? Is there any theory or rationale behind it? If there was, I did not find it in the paper (other than it performed well in their experiments). I can easily imagine it failing poorly under situations they did not test.

Although I have been critical of the paper, I am glad that I read it. It made me think about different ways to implement performant systems, and brought up a few issues that are good to think about. However, I somehow doubt that I will be using the SEDA architecture the next time I need to design a robust system.