What is preemption?
Preemption, otherwise known as preemptive scheduling, is an operating system concept that allows running tasks to be forcibly interrupted by the kernel so that other tasks can run. Preemption is essential for fairly scheduling tasks and guaranteeing that progress is made because it prevents tasks from hogging the CPU either unwittingly or intentionally. And because it’s handled by the kernel, it means that tasks don’t have to worry about voluntarily giving up the CPU.
It can be useful to think of preemption as a way to reduce scheduler latency. But reducing latency usually also affects throughput, so there’s a balance that needs to be maintained between getting a lot of work done (high throughput) and scheduling tasks as soon as they’re ready to run (low latency).
The Linux kernel supports multiple preemption models so that you can tune the preemption behaviour for your workload.
The three Linux kernel preemption models
Originally there were only two preemption options for the kernel:
running with preemption on or off. That setting was controlled by the
kernel config option,
CONFIG_PREEMPT. If you were running Linux on a
desktop you were supposed to enable preemption to improve interactivity
so that when you moved your mouse the cursor on the screen would respond
almost immediately. If you were running Linux on a server you ran with
CONFIG_PREEMPT=n to maximise throughput.
Then in 2005, Ingo Molnar introduced a third option named
CONFIG_PREEMPT_VOLUNTARY that was designed to offer a middle point on
the latency-throughput spectrum – more responsive than disabling
preemption and offering better throughput than running with full
preemption enabled. Nowadays,
CONFIG_PREEMPT_VOLUNTARY is the default
setting for pretty much all Linux distributions since openSUSE
switched at the
beginning of this year.
Unfortunately, choosing the best Linux kernel preemption model is not straightforward. Like with most performance topics, the best way to pick the right option is to run some tests and use cold hard numbers to make your decision.
What are the differences in practice?
If you’re interested in reproducing the tests on your own hardware, here’s how to do it.
cyclictest records the maximum latency between when a timer expires and the thread that set the timer runs. It’s a fair indication of worst-case scheduler latency.
The above results show that the best (lowest) latency is achieved when
CONFIG_PREEMPT. It’s not a universal win, as you can see
from the first data point. But overall,
CONFIG_PREEMPT does a decent job
of keeping those latencies down.
CONFIG_PREEMPT_VOLUNTARY is a good
middle ground and exhibits slightly worse latency while
CONFIG_PREEMPT_NONE shows the the worst (highest) latencies of all.
Based on the descriptions of the kernel config options given in the
preemption models section, I’m sure we can all agree these are roughly
the results we expected to see.
Next, let’s look at sockperf’s TCP throughput results. sockperf is a network benchmark that measures throughput and latency over TCP and UDP. For this experiment, we’re only interested in the throughput scores.
It’s a little hard to make out some of the results, but each of the
different message sizes shows that
CONFIG_PREEMPT_NONE achieves the
best throughput, followed by
CONFIG_PREEMPT_VOLUNTARY and with
CONFIG_PREEMPT coming last. Again, this is the expected result.
Things get a little weirder with sockperf’s UDP throughput results.
CONFIG_PREEMPT_VOLUNTARY consistently achieves the highest
throughput. I haven’t dug into exactly why this might be the case, but
my best guess is that preemption doesn’t matter as much for UDP
workloads because it’s stateless and doesn’t exchange multiple messages
between sender and receiver like TCP does.
If you’ve got any ideas to explain the UDP throughput results please leave them in the comments!