Last week I gave a talk at the first virtual adhoc.community meetup on the history of task isolation on Linux (slides, video). It was a quick 15-minute presentation, and I think it went well, but I really wanted to include some details of how you actually configure a modern Linux machine to run a workload without interruption. That’s kinda difficult to do in 15 minutes.
So that’s what this post is about.
I’m not going to cover how to use the latest task isolation mode patches because they’re still under discussion on the linux-kernel mailing list. Instead, I’m just going to talk about how to reduce OS jitter by isolating tasks using Linux v4.17+.
First, as the below chart shows, you really do need a recent Linux kernel if you’re going to run an isolated workload because years of work have gone into making the kernel leave your tasks alone when you ask.
Each of these features is incremental and builds on top of the previous ones to quiesce a different part of the kernel. You need to use all of them.
Modern Linux does a pretty good job out of the box of allowing userspace tasks to run continuously once you pull the right options. Here’s my kernel command-line for isolating CPU 47:
isolcpus=nohz,domain,47 nohz_full=47 tsc=reliable mce=off
The first option, isolcpus, removes every CPU in
the list from the scheduler’s domains, meaning that the kernel will not
to do things like run the load balancer for them, and it also disables
the scheduler tick (that’s what the
nohz flag is for).
disables the tick (yes, there’s some overlap of the in-kernel flags
which means you need both of these options) as well as offloading RCU
callbacks and other miscellaneous items.
On my machine, I needed the last two options to disable some additional timers and prevent them from firing while my task was running.
Once you’ve booted with these parameters (substitue your desired CPU
47) you’ll need to setup a cpuset cgroup to run your task in
and make sure that no other tasks accidentally run on your dedicated
CPUs. cset is definitely my favourite
tool for doing this because it makes it so easy:
Now all you need to do is add the PID of your task to the new
cpuset and you’re good to go.
Verifying your workload is isolated
Of course, it’s all well and good me saying that these options isolate your tasks, but how can you know for sure? Fortunately, Linux’s tracing facilities make this super simple to verify and you can use ftrace to calculate when your workload is running in userspace by watching for when it’s not inside the kernel – in other words, by watching for when your workload returns from a system call, page fault, exception, or interrupt.
Say we want to run the following super-sophisticated workload without it entering the kernel:
Here’s a sequence of steps – assuming you’ve already setup the
cset – that enables ftrace, runs the workload for 30
seconds, and then dumps the kernel trace to a
The contents of your
trace.txt file should look something like this:
You want to make sure that the you didn’t lose any events by checking
entries-in-buffer/entries-written fields have the same
values. If they’re not the same you can further increase the buffer size
by writing to
The key part of the trace file is the
which tells you when a context switch completed. You can use this
tracepoint to find when your bash process starts running and when it
finishes – hopefully after 30 seconds has elapsed – with a bit of awk
I’ve successfully used this technique to verify that I can run a bash busy-loop for an hour without entering the kernel.