Why you should tell Qemu about your L3 cache

OK, so I'm trying to compare the performance of a database workload both on bare metal and on KVM, and I notice that the KVM guest doesn't know about the processor's L3 cache (there are a few ways to figure that out, I discovered it when staring at the output of lscpu).

The machine I was running on was an IvyBridge NUMA machine with 4 sockets, with 30 cores (hyperthreading enabled), for a total of 120 logical CPUs.

Not wanting to deal with NUMA issues straight away, I restricted the workload to the 30 CPUs on a single socket. Because all the CPUs are on the same socket, they all share an L3 cache.

I googled for ways to make the L3 cache visible in the guest and came across this post on serverfault.com, wherein the author asks whether it matters if the guest can see the details of the L3 cache or not. Here's the reply,

I don't think it should matter.

The host makes this data available to the guest, via a virtual CPU/Core. I can imagine that the host can provide the guest with arbitrary values without really affecting performance that much, since it's the host that ultimately determines performance anyway.

On the other hand, if KVM does bare metal virtualisation, maybe the cache levels reported by the guest represents a direct correlation with the real CPU, since the guest has direct access to the hardware CPU. Thus installing a better CPU will give better performance in the guest.

Sorry internet, but you're wrong.

The CPU topology, which is the physical layout of CPUs, caches and interconnects, matters very, very much to the Linux kernel scheduler.

Internally, the scheduler builds a map of the topology using struct sched_domains (these are NOT directly to NUMA domains) so that it knows which CPUs share caches, which are on the same socket, etc. You can see how they're laid out by looking in /proc/sys/kernel/sched_domain/cpu*/.

Having this information is important for doing things like load balancing tasks, pulling tasks to the same CPU/socket when doing wakeups (known as wake affinity), and deciding on the most efficient way to enqueue tasks on CPU runqueues.

That last item is exactly why Qemu does care that your CPUs share an L3 cache.

If task "foo" on CPU A tries to enqueue task "bar" on CPU D (say, because it's sending a signal) and CPU A and D share an L3, then the enqueue operation is fairly cheap - task "foo" accesses CPU D's runqueue data structure directly.

However, if CPU's A and D do not share a cache then an inter-processor interrupt (IPI) is sent to CPU D. Sending an IPI can be a pretty expensive operation in general, but more so when you're running in a guest since it requires the guest to exit into the host (via a VMEXIT) to handle the IPI.

That's a lot of work simply to put a task on a runqueue. Worse, it's unnecessary work if CPU A and D physically share an L3.

How do you tell Qemu about the L3? There are two ways to fix this issue. You can either modify the sched_domain flags directly and set the SD_SHARE_PKG_RESOURCES flag, which is hacky and potentially might not work across kernel versions, or you can use the new "virtual L3-cache" option from newer versions of Qemu.

Modifying sched_domain flags

You need to set the SD_SHARE_PKG_RESOURCES flag for all CPUs that share the L3. This is as simple as looking at the existing flags value for the upper-most domain and adding 0x200, e.g.

    cat /proc/sys/kernel/sched_domain/cpu0/domain1/flags

Assuming it's something like 0x102f, you can do,

    for cpu in /proc/sys/kernel/sched_domain/cpu*/domain1/flags; do
        echo 0x122f > $cpu
    done

Note, lying to the kernel and setting this flag when the virtual CPUs do not, in fact, share an L3, is unlikely to give you a performance improvement.

Using Qemu's l3-cache option

If you're using Qemu v2.8.0-rc0 or newer you can use "-cpu l3-cache=on" as arguments to Qemu to create a virtual L3 cache. The patch that added support for this actually contains a really nice description of the performance wins when turning the virtual L3 cache on.

In general, you should try and make the guest completely aware of the physical topology of the machine it's running on, whenever possible - the Linux kernel contains lots of optimisations like the one for enqueueing tasks.