When you have a limited supply of some resource, and a demand for that resource that exceeds the supply, you have something economists call a shortage. In this article I explain how we deal with shortages of resources (called congestion) in software systems.
In economics we learn about the concept of Supply and Demand. Simply put, as something becomes more scarce, the price will adjust upward until an equilibrium is met between the supply and the demand. This holds true for situations where the pricing can be adjusted.
So, what happens when the pricing is fixed? You need some strategy to deal with the scarcity. A discipline known as queueing theory can describe this. Let’s think about a common example of a system where the demand (work) exceeds the supply (capacity). I will explain these in terms of system engineering, as they would pertain to a software system.
In the grocery store there are a finite number of checkout clerks. We will call the lines of waiting shoppers queues. If queue length increases, the system is experiencing congestion. When this happens the store manager calls in more workers (store employees) to help with checkouts. This capacity expansion process may continue until all checkstands are open, or until there are no more available workers to add. This condition is the maximum capacity of the system. If we are at maximum capacity, we hope that the congestion is eliminated. If expanding the capacity did not alleviate the congestion, then the shoppers must wait. New shoppers entering the store may see the long queues, and decide to leave. Some people standing in line may abandon the line, and leave the store without completing their intended purchase.
Having multiple shorter queues gives shoppers the illusion that the congestion is lower than it actually is. Some checkout queues move more quickly than others, so having one queue per checkstand is not a fair system. A shopper may wait longer than they deserve because of what queue they select. Sometimes shoppers will jump from one queue to another if they feel they won’t have to wait as long.
Lesson One: Use a combined queue instead of multiple separate queues
When all shoppers have equal priority, it’s better to use a common queue that’s serviced by multiple workers. This way every shopper in the queue waits a fair share of the congestion backlog. Nobody is forced to wait longer because of queue selection, since there is only one. This also eliminates the inefficiency of changing between queues when one is slow.
Lesson Two: When using a priority queue, limit the number of priority waiters
This leads to the temptation to have multiple classifications of shared queues. If you want to have a sense of a premium service for select priority shoppers, you can create two queues. The priority queue is for your select priority shoppers, and the main queue is for everyone else. This works when the population of priority shoppers is smaller than the general population and they arrive in the line infrequently. In this situation, the next available worker can select a shopper to service from the priority queue before they service someone from the main queue. If there are too many priority shoppers, the main queue does not get serviced, which leads to further congestion, and eventually the waiters will abandon the main queue.
Lesson Three: Use Admission Control
Now, you are familiar with the example of using the store, and simple queues, and the concept of consolidating multiple queues into a single queue, or the combination of a single queue and a priority queue. These are generally easy concepts to implement in software, and for most cases where congestion is rare, they work fine. However there are some conditions where they don’t work at all. Remember that in software systems, queued jobs may or may not be able to abandon the queue. Here are examples of when a simple queue with no admission control will fail:
- When you have the ability to create an unlimited number of workers, but they all rely on some shared pool or pools of limited resources. This is called a concurrency bottleneck. For example, when you have a finite amount of CPU resources, but the ability to create an effectively unlimited number of threads. There is an upper limit to the number of additional throughput gained by adding more threads when the CPU is congested.
- When there is an unlimited number of jobs (waiters) to service, that arrive suddenly in bursts or steadily at a rate that’s faster than the combined capacity of all your workers. This is called a work overflow.
The concept of admission control allows you to implement a policy controlling how much work you will accept into your queue(s) and at what rate. One simple policy is a maximum queue length. This length (limit) can be chosen by considering your desired maximum wait time (max) and your average service time (t) using the formula:
limit = ( max / t )
When limit jobs are queued, you must not queue new work, but instead refuse the work. Some protocols have a response code that can be used for this. For example, in HTTP applications, you can return a 502 Server Busy response with an optional Retry-After header that indicates to the client that they may retry the request at the specified later time when the congestion may be gone.
A more advanced policy may use a strategy where the source of work and the type of request, and apply a different policy. For example, you may limit the rate at which you accept requests from a given IP Address or geography. You may prioritize write requests over read requests. If an individual resource is congested, you may reject requests for that resource, but accept other requests.
There are a number of advantages to using an admission control policy:
- You don’t end up overloading your system to the point where service quality degrades for everyone due to extreme congestion conditions.
- You can temporarily smooth out your work pattern so that the work can be completed at some later time when a temporary surge in demand has subsided.
- It is very easy to monitor for congestion. A simple external check of your service will return an error during congestion conditions.
- If the system has elastic capability to add more worker resources, this can be triggered when queue lengths are consistently non-zero over a span of time, and reduced when queue lengths are consistently zero over a span of time.
Lesson Four: Beware of the Thundering Herd and the Convoy Effect
If you have a workload that experiences sudden bursts where new work is rapidly queued, and you have not properly limited the number of workers to a level of concurrency that you can manage with your available system resources, then you may experience a condition known as resource starvation. This can lead to a complete system failure, typically in a cascading series of related failures. When your system is in a state of resource starvation, it is too busy handling multitudes of concurrent work requests that it’s unable to make meaningful progress servicing the work. Nothing completes in a reasonable time because you are too busy switching between all the requests making tiny increments of progress on each.
A Convoy Effect is where you have numerous threads or processes that are blocked (suspended making no progress) on some resource (like an I/O operation, or a lock) and then cause additional congestion when the blocking condition clears. At this point all the runnable threads are synchronized together in a convoy, requesting other scarce resources (like Disk I/O bandwidth to write a lock file) and that the existence of the convoy is causing the delays to be further delayed than if there were simply a long queue of work. For example:
- A long queue of requests is present.
- All workers dequeue the requests rapidly, but all of them need to read and write to the same file.
- Concurrent reads are allowed with a shared read lock, and concurrent writes are protected by an exclusive write lock.
- The underlying lock implementation uses a spinlock.
- Multitudes of concurrent readers cause write access to the file to be substantially slowed down because the disk drive is chattering doing all the reads.
- All workers become synchronized on the exclusive write lock, slowing down the rate of progress, and essentially causing the system to be in a state of live lock.
A Thundering Herd is a situation where a multitude of serialized processes are blocked waiting on an event. When the event happens, all processes become runnable, but only one of them can be serviced at a time, so all the others must become blocked again waiting on a new event. This condition causes throughput of work to be suboptimal because of the wasted effort of the blocked processed getting woken up and going back to sleep. This is generally solved by only waking one of the blocked processes at a time.
So, a Thundering Herd will typically lead to a Convoy Effect, leaving your system in a critical state of congestion. Solve this by using sensible admission control, and more sophisticated queuing strategies.
Lesson Five: Multiplex de-queue of identical requests
If you have a queue of requests that are likely to have multiple identical queued entries, you can optimize your service of that queue by using a Multiplexing technique. Instead of each worker simply taking one request off the queue and servicing it, consider this approach instead:
- Read the first request on the queue.
- Scan remaining entries in the queue for other identical requests, de-queuing them together as a batch. You might de-queue up to some maximum number per batch, or de-queue all matches in a single batch. This depends on how large your responses will be, and how many items you have in queue.
- Process the request, sending the response to all clients in the batch at once.
Note: this is only appropriate in use cases where out-of-order responses are gracefully handled by the client.
You will need to be careful that your queue implementation will allow you to safely de-queue any entry in the queue without leading to a race condition where multiple workers are trying to de-queue the same entries simultaneously. You may decide that batching requests when they are queued up is more efficient than searching for matching requests when it’s time to de-queue. It would work in a similar way:
- Treat all queued entries as batches, with at least one request in each, up to some maximum.
- Hash the request as you receive it, and check to see if you have an existing entry or entries in the queue for the given request.
- If have a matching entry in the queue, then temporarily lock that entry, and add the new request’s client details (or connection handle) to the existing entry.
- If the batch is full, iterate to the next, creating a new batch when you reach the end of the list.
- Workers simply read the next batch from the queue, and send the response to all the associated clients.
In general, you want your workers de-queueing work faster than you can queue it up. If you do your batching upon receipt of the request, you avoid the risk that the efficiency cost of the batching does not lead to a Convoy Effect.
Feedback welcome.