Linux v4.11 was released this week and it includes my patches that add new diagnostic checks to the scheduler. There was quite a bit of fallout when they were merged, and even Linus triggered one of the asserts.
Dramatic start or not, the new checks are useful for detecting a long-standing problem in the kernel; missing updates to the scheduler’s runqueue clocks.
Fundamentally the Linux kernel scheduler cares about three things: what to run, where to run and when. That is to say, it cares about tasks (what), CPUs (where) and time (when).
To know when to run tasks, the scheduler needs an accurate view of time. It needs to know:
- How long a task has been running and if it needs preempting
- Which task ran least recently and should be scheduled next
- If it’s time to do housekeeping work like load balancing tasks
- Average idle duration of CPUs for housekeeping cost estimation
Continuously tracking the current time would be prohibitively expensive, so every scheduler runqueue maintains its own software clock that is periodically synchronised with hardware. Basically, the runqueue clock holds a copy of the value read from a hardware clock. It’s the timestamp of when the runqueue clock was updated.
Exactly when the runqueue clock gets updated varies from code path to
code path. There are many calls to
update_rq_clock() (I count 35 in
Linus’ tree), and it’s extremely easy to forget to add one. And
therein lies the problem…
Missing clock updates
It turned out that there were a bunch of places inside the kernel that didn’t correctly update the runqueue clocks before doing important operations. This went unnoticed because nothing checked that the clocks had been updated.
One example was where updating a task’s load average when moving it between cgroups was using an incorrect view of time, another was how the deadline scheduler read the potential stale clock when calculating a task’s deadline after migrating to a new CPU.
Missing updates don’t necessarily cause crashes, and which is why not checking for updates is so dangerous; things just silently keep working in spite of the logic errors.
The efi.flags example
This isn’t the first example of implicit API agreements going
unverified inside the kernel. I accidentally wrote one. When
efi.flags code was
it was intended to be a way to query the features of a machine’s EFI
I audited all users of
efi.flags when I added the wrapper to ensure
that flags were being set correctly. But as new users were added to
the kernel bugs were introduced where flags were unset even though they
And this happened despite me reviewing all changes to the EFI subsystem. It’s the kind of issue that is best detected by the internals of the API at runtime. With assertions.
Assertions are contracts
As with any large open source project, patches get applied, code changes and while everything might have worked at one point in time, that is unlikely to be true forever.
Asserts, or something similar, should have been introduced at the start for both examples above. Asserts keep your code correct even as it evolves. Baking the expected state into your API checks makes this contract between API provider and user concrete.
And remember The Golden Rule of using assertions in the Linux kernel: Don’t use BUG_ON.