Early x86 Linux boot debug tricks

Trying to diagnose problems in the early asm instructions of the x86 Linux kernel is just about the most cumbersome task you can perform in the kernel. There are still no good methods of debugging very early booting issues. For problems that occur before the serial console and EFI framebuffer are initialized the only solution is to force your machine to reboot or hang at strategic locations in the kernel to try and home in on the root cause.

If your machine is misbehaving during boot there are two symptoms that you will want to debug: either the machine unexpectedly hangs, or unexpectedly resets.

Below are some tricks that I rely on every time someone comes to me with an early EFI booting problem. I've also used them when writing the EFI boot stub and EFI mixed mode kernel patches. These techniques are not pretty, but they get the job done when you're out of other options.

Debugging a mysterious reset with an infinite loop

The strategy to employ in this scenario is to force your machine to hang. Debugging this is made slightly easier because you can use one trick for all code paths; hang the machine.

The usual idiom is this,

1:
	hlt
	jmp 1b

which causes the machine to halt if it reaches this instruction. For example, assume that there's a bug in the EFI boot stub such that hdr.code32_start isn't initialized correctly. The buggy code would look like this (modified from the original),

        call efi_main
        movq %rax,%rsi

        movl BP_code32_start(%rsi), %eax
        leaq preferred_addr(%rax), %rax
        jmp *%rax

preferred_addr:

Assuming we're jumping through an invalid pointer in %rax, executing the jmp instruction will cause a reset. But suppose we didn't know that already. Instead, we'd have to gradually modify the code as we got closer and closer to the root of the issue. The first time it might look like this,

        call efi_main
1:
        hlt
        jmp 1b

        movq %rax,%rsi

        movl BP_code32_start(%rsi), %eax
        leaq preferred_addr(%rax), %rax
        jmp *%rax

preferred_addr:

And the machine hangs. OK, good. We know everything upto and including efi_main() is working fine. We'll quickly realise that a modification like this returns us back to the resetting problem,

        call efi_main
        movq %rax,%rsi

        movl BP_code32_start(%rsi), %eax
        leaq preferred_addr(%rax), %rax
        jmp *%rax

preferred_addr:
1:
        hlt
        jmp 1b

Bingo. The problem is obviously a bogus %rax value. Of course, things get substantially easier once you get to the C code in x86_64_start_kernel() and can use the old reliable idiom,

        while (1);

Debugging a mysterious reset with an infinite loop

Debugging a hang by triggering a reset