Last week I gave a talk at the first virtual adhoc.community meetup on the history of task isolation on Linux (slides, video). It was a quick 15-minute presentation, and I think it went well, but I really wanted to include some details of how you actually configure a modern Linux machine to run a workload without interruption. That’s kinda difficult to do in 15 minutes.

So that’s what this post is about.

I’m not going to cover how to use the latest task isolation mode patches because they’re still under discussion on the linux-kernel mailing list. Instead, I’m just going to talk about how to reduce OS jitter by isolating tasks using Linux v4.17+.

First, as the below chart shows, you really do need a recent Linux kernel if you’re going to run an isolated workload because years of work have gone into making the kernel leave your tasks alone when you ask.

Linux task isolation features throughout the years

Each of these features is incremental and builds on top of the previous ones to quiesce a different part of the kernel. You need to use all of them.

Modern Linux does a pretty good job out of the box of allowing userspace tasks to run continuously once you pull the right options. Here’s my kernel command-line for isolating CPU 47:

isolcpus=nohz,domain,47 nohz_full=47 tsc=reliable mce=off

The first option, isolcpus, removes every CPU in the list from the scheduler’s domains, meaning that the kernel will not to do things like run the load balancer for them, and it also disables the scheduler tick (that’s what the nohz flag is for). nohz_full= disables the tick (yes, there’s some overlap of the in-kernel flags which means you need both of these options) as well as offloading RCU callbacks and other miscellaneous items.

On my machine, I needed the last two options to disable some additional timers and prevent them from firing while my task was running.

Once you’ve booted with these parameters (substitue your desired CPU list for 47) you’ll need to setup a cpuset cgroup to run your task in and make sure that no other tasks accidentally run on your dedicated CPUs. cset is definitely my favourite tool for doing this because it makes it so easy:

Now all you need to do is add the PID of your task to the new user cpuset and you’re good to go.

Of course, it’s all well and good me saying that these options isolate your tasks, but how can you know for sure? Fortunately, Linux’s tracing facilities make this super simple to verify and you can use ftrace to calculate when your workload is running in userspace by watching for when it’s not inside the kernel – in other words, by watching for when your workload returns from a system call, page fault, exception, or interrupt.

Say we want to run the following super-sophisticated workload without it entering the kernel:

Here’s a sequence of steps – assuming you’ve already setup the user cpuset using cset – that enables ftrace, runs the workload for 30 seconds, and then dumps the kernel trace to a trace.txt

The contents of your trace.txt file should look something like this:

You want to make sure that the you didn’t lose any events by checking that the entries-in-buffer/entries-written fields have the same values. If they’re not the same you can further increase the buffer size by writing to tracing/per_cpu/<cpu>/buffer_size_kb.

The key part of the trace file is the finish_task_switch tracepoint which tells you when a context switch completed. You can use this tracepoint to find when your bash process starts running and when it finishes – hopefully after 30 seconds has elapsed – with a bit of awk magic:

I’ve successfully used this technique to verify that I can run a bash busy-loop for an hour without entering the kernel.