LMBench versus GCC Optimisations
A quality benchmark is authoritative and trustworthy, and when you’re using one it’s a bit like playing the card game Snap: the rules are easy, and when the game is over it’s obvious who won.
But a poor benchmark makes performance work more like trying to solve a twisted version of the Knights and Knaves riddle where you’re not sure if the answers you’re getting are truths or lies, no one ever wins, and you only stop playing because you’re exhausted.
LMBench definitely has that riddle vibe.
I just don’t trust the test results that it spits out because I’ve run into too many dead ends when investigating performance issues that turned out to be false positives. And if there’s one thing that you need to be sure of when measuring performance, it’s the accuracy of your results.
So I was less than convinced when I recently saw that the int64-mul subtest of the LMBench ops microbenchmark was taking between 10% and 20% longer to run with a new enterprise kernel.
With my suspicions suitably heightened, I started reading the source code to understand exactly what the test was doing.
The int64-mul subtest tests the CPU speed of 64-bit integer multiplication. Here’s an edited version:
Seeing the register
keyword always sets alarm bells ringing for me.
Not because it has no purpose – you can use it to disallow using the
unary address-of operator on a
variable, which
lets the compiler optimise accesses to that variable – but because it
usually indicates that the benchmark has been written with a specific
compiler implementation, or version, in mind. LMBench was released in
1996 which would have made GCC 2.7 the current
version.
Using the register
keyword may have helped old compilers optimise
access to variables by allocating registers for them, but modern
compilers ignore register
when making register allocation decisions.
Before doing anything else, I wanted to verify that the compiler was emitting those 64-bit multiplication operations on lines 17-21 above.
00000000004004cb <do_int64_mul>:
4004cb: 89 f2 mov %esi,%edx
4004cd: 8d 46 06 lea 0x6(%rsi),%eax
4004d0: 48 c1 e0 20 shl $0x20,%rax
4004d4: 48 8d 84 02 2c 92 00 lea 0x922c(%rdx,%rax,1),%rax
4004db: 00
4004dc: 83 ef 01 sub $0x1,%edi
4004df: 83 ff ff cmp $0xffffffff,%edi
4004e2: 75 f8 jne 4004dc <do_int64_mul+0x11>
4004e4: 89 c7 mov %eax,%edi
4004e6: e8 cb ff ff ff callq 4004b6 <use_int>
4004eb: f3 c3 repz retq
Nope. There’s a complete lack of 64-bit multiplication anywhere in
there. As far as the compiler is concerned, the following C code is
equivalent to LMBench’s do_int64_mul()
:
Which makes the test useless because GCC optimised it away.
Why did GCC optimise out the test?
GCC could tell exactly how many times it needed to add all of those 64-bit constants together and used techniques like Constant folding and propagation to calculate the end value at compile time instead of runtime.
While investigating this issue I discovered that GCC didn’t throw away the useless loop on lines 8-9 because LMBench uses the -O switch which doesn’t include the necessary optimisation flag. Here’s the full list of optimisations and which level they are enabled for.
This is the problem with microbenchmarks that assume a specific toolchain version or implementation – upgrading the toolchain can break them without you realising. Instead of writing the inner loops in C (the authors wanted it to be portable), inline assembly would have prevented the compiler from eliminating them.
Tests like int64-mul are so low-level that I’ve heard them referred to as nanobenchmarks; they are notoriously easy to misuse and misunderstand. Here’s Aleksy Shipilëv, infamous JVM performance expert, showing how to use them with JMH, a benchmark harness:
Meanwhile, this is yet another example how you should approach nanobenchmarks (and #JMH makes it convenient enough): http://t.co/vw0jVt8x0d
— Aleksey Shipilëv (@shipilev) July 17, 2015
Is it time to retire LMBench?
As much as I distrust LMBench, I actually plan to keep using it. Why?
Because it has some other subtests that are useful, like the fork()
microbenchmark test, which detected the overhead of the
vm_ops->map_pages()
API when it was
introduced.
But the CPU ops subtest? No, that nanobenchmark definitely needs to go in the trash.