A Kernel Dev's Approach to Improving Mutt's Performance

A large chunk of my day is spent sifting through my inbox -- mail, and especially patches, drive every task I work on. I use mutt to read all my mail, and its mark-thread-read functionality is a godsend for mailing lists. It allows me to quickly process unread mail and ignore those threads that I don't have any interest in.

But I've got a confession: I don't actually open the mail folder containing linux-kernel mailing list emails anymore. I did years ago, but since LKML started receiving over 400 emails a day (my folder is 6.4GB), I've given up.

There are two reasons:

1. Opening the folder takes over a minute
2. Mutt locks up when the LKML folder is open and new mail arrives

Instead, over time, I've constructed a bunch of Gmail filters to siphon off those LKML mails that I actually want to read. And that works OK, but I still miss interesting threads for which I have no prior filters configured. That bugs me. I finally decided to do something about it and dig into why mutt was performing so poorly.

I'm only going to discuss item 1. above in this post. In part 2, I'll investigate the lockup. In addition, this post is restricted to tuning mutt's performance via config options, which is basically the lowest hanging fruit.

It turns out that a lot of the tunings discussed below are described in mutt's documentation. I didn't read that documentation beforehand, and besides, the following techniques demonstrate how to tackle performance issues from first principles. The fact that they turned up items that were discussed in the official mutt documentation goes to show how well they work!

A reproduction case

First, we need a simple way to reproduce the issue. Handily, mutt provides a way to open a maildir folder on startup (you do need to hit the 'q' key once mutt is running),

    $ time -p mutt -f ~/Maildir/.lkml
    real 84.88
    user 31.56
    sys 19.07

Our baseline for opening the LKML folder is around 84 seconds. Armed with this, we can dive into mutt's performance.

Checking System-wide Health

It's a good idea to look at the state of the entire system before delving into application-specific issues. For example, if the system is swapping (and that would be bad) that'll show up with system-wide statistics.

Running `vmstat 1` while opening the LKML folder shows that cache is slowly increasing,

    procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
     r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
     1  0      0 4761080   2316 1994984    0    0 55392  3268 7571 30468  9 7 72 13  0
     1  0      0 4669444   2316 2064456    0    0 61876     0 8106 32759 9  8 72 12  0
     1  0      0 4561688   2316 2146792    0    0 73448     0 9334 38210 10  7 73 10  0
     0  1      0 4478980   2316 2213772    0    0 59488  3264 7925 32041 7  8 72 12  0
     0  1      0 4382508   2316 2285936    0    0 63856     0 8956 34934 10  7 72 11  0
     2  1      0 4297220   2316 2349256    0    0 56128 23504 9099 37290 14 10 63 13  0

And once we're done we see,

    procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
     r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
     0  0      0  60472    324 5035292    0    0     0     0   87  222  0 0 100  0  0
     0  0      0  60472    324 5035288    0    0     0     0   81  185  0 0 100  0  0
     0  0      0  60472    324 5035288    0    0     0     0   88  210  0 0 100  0  0
     0  0      0  60472    324 5035288    0    0     0     0   80  186  0 0 100  0  0

That's only ~60MB in the free column. The majority is allocated to the page cache. Additionally the block-in (bi) and block-out (bo) columns indicate that the system is reading/writing to disk. Which is totally expected when opening a mail folder.

From a system health perspective, this looks fine. We're not swapping, we're not trying to run loads of tasks -- we're not seeing any kind of saturation of resources (CPU, memory, etc).

Breaking things down at the application level

Using `pidstat -C mutt 1` shows that roughly half of the time is spent on the CPU and the rest spent waiting for disk I/O (here's a sample from the middle of a trace):

    21:53:07      UID       PID    %usr %system  %guest    %CPU   CPU  Command
    21:53:08     1000      4475   31.00   22.00    0.00   53.00     0  mutt

Note that things are simplified enormously when parsing the %CPU column because mutt isn't a threaded application.

The mean CPU usage from start to finish is 52.7%, which means mutt spent roughly half of its time on-CPU and half off-CPU. The time spent off-CPU was probably waiting for I/O to complete, given the above vmstat output.

But we don't need to guess, we can confirm that by summing the delays reported by the sched_stat_iowait tracepoint. A typical entry looks like this,

    kworker/u8:5-7166  [001] d..4  78.852761: sched_stat_iowait: comm=mutt pid=7200 delay=804910 [ns]

Using a bit of sed, grep and awk we can sum these delays and print the total delay in seconds,

    grep -E "iowait.*mutt" trace  | sed -e 's/.*delay=//' | \
        awk 'BEGIN{s=0}{s += $1} END{printf "%.2fs\n", s/1000000000}'

Which is 32.62s, and that's only 38% of the 84 seconds runtime recorded by pidstat. The rest of the off-CPU time was probably caused by sleeping on mutexes or waitqueues, etc in the kernel.

Clearly, we need to be figuring out what mutt is doing when it's on-CPU executing instructions since that is where the majority (62%) of time is going.

Using perf record we can get that info. Here are all the functions that took more than 1% of the runtime,

    $ perf report -g fractal --no-children

    +    5.07%  mutt     libc-2.24.so        [.] __strchr_sse2
    +    3.23%  mutt     libc-2.24.so        [.] malloc_consolidate
    +    3.15%  mutt     [kernel.kallsyms]   [k] _xfs_buf_find
    +    3.09%  mutt     libc-2.24.so        [.] _int_malloc
    +    2.78%  mutt     libc-2.24.so        [.] __strspn_sse42
    +    1.46%  mutt     libc-2.24.so        [.] _int_free
    +    1.43%  mutt     libc-2.24.so        [.] strlen
    +    1.32%  mutt     [kernel.kallsyms]   [k] kmem_cache_alloc
    +    1.17%  mutt     [kernel.kallsyms]   [k] entry_SYSCALL_64
    +    1.10%  mutt     mutt                [.] strchr@plt
    +    1.06%  mutt     libc-2.24.so        [.] _IO_getline_info

Mutt is doing some string operations, something with malloc(3) and when I expanded the _xfs_buf_find() call stack it showed that it is hit mainly from the stat(2) syscall path.

Turning on the --children option rebuilding mutt with frame pointers (-fno-omit-frame-pointer) gives us more mutt symbols in the profile,

  Children      Self  Command  Shared Object       Symbol
    36.85%     0.32%  mutt     [kernel.kallsyms]   [k] entry_SYSCALL_64_fastpath
    36.20%     0.00%  mutt     mutt                [.] main
    36.20%     0.00%  mutt     libc-2.24.so        [.] __libc_start_main
    36.20%     0.00%  mutt     [unknown]           [k] 0x26c6258d4c544155
    34.98%     0.00%  mutt     mutt                [.] mx_open_mailbox
    34.95%     0.00%  mutt     mutt                [.] mh_read_dir
    34.95%     0.00%  mutt     mutt                [.] maildir_read_dir
    33.72%     0.22%  mutt     mutt                [.] maildir_delayed_parsing
    29.60%     0.62%  mutt     mutt                [.] mutt_read_rfc822_header
    20.91%     0.68%  mutt     mutt                [.] mutt_parse_rfc822_line
    17.62%     0.03%  mutt     libc-2.24.so        [.] __GI___libc_open
    17.49%     0.01%  mutt     [kernel.kallsyms]   [k] sys_open

The majority of the time is spent in the following call path,

-   36.20%     0.00%  mutt     mutt                [.] main
   - main
      - 96.64% mx_open_mailbox
         - 99.90% maildir_read_dir
            - mh_read_dir
               - 96.47% maildir_delayed_parsing
                  + 88.32% mutt_read_rfc822_header

Reading the mutt source code, it looked like maildir_delayed_parsing()opens *every* file in a maildir and reads the header contents. Given that my LKML directory size is 6.4GB, that's gotta be expensive. Plus that seemed a strange kind of thing to do every time I open a mail folder since the headers shouldn't be changing. It would make far more sense to cache those headers somewhere rather than reading and parsing them every time.

With that in mind, while reading the mutt source I noticed that there are references to a "hcache" or "header cache" in maildir_delayed_parsing(). That sounded exactly like something that would help improve the speed of opening the LKML folder. Sure enough, there's a header_cache config variable in my muttrc that was commented out and therefore, disabled. Uncommenting it showed,

Error in /home/matt/.muttrc, line 1606: header_cache: unknown variable
source: errors in /home/matt/.muttrc
Press any key to continue...

Rebuilding mutt with header cache support and turning on the header_cache config option speeds things up about 3x from 79.48s to 22.04s!

Mutt's header cache

At this point, I'm wondering whether we can do better by using a different database for the hcache. There are four available: tokyocabinet, qdbm, gdbm and Berkeley4 (the default).

I couldn't find QDBM for openSUSE, but it's a sourceforge project that hasn't been updated since 2013. I didn't hold much hope for it.

Here are the choices of database and the size after initialisation, along with the time to open the LKML folder when the header cache is enabled. I'm ignoring the initialisation times because that's a one-time operation and frankly, I'd be willing to put up with waiting minutes to create the database if it shaved off a few seconds every time I opened the maildir folder.

DB	Time (sec)	DB Size (MB)
Berkeley4	22.04	1620
tokyocabinet	24.04	239
gdbm	22.03	855

In terms of database size and speed, gdbm is the winner, but only marginally over the default, Berkeley4. There is clearly a cost for the compression of the tokyocabinet file.

Doing on-CPU vs off-CPU analysis again, we're now spending even *more*time doing CPU intensive operations. The call stack from a perf record profile now looks quite different,

-   23.05%     0.00%  mutt     mutt                [.] main
  - main
      - 80.51% mx_open_mailbox
         - 60.39% maildir_read_dir
            - mh_read_dir
               - 77.50% maildir_delayed_parsing
                  - 52.07% mutt_hcache_restore
                     - 65.97% restore_address
                        + 32.93% _int_malloc
                        + 31.92% restore_char
                        + 15.55% __libc_calloc
                          11.21% safe_calloc
                          4.42% __memmove_avx_unaligned_erms
                          1.51% safe_malloc
                          1.16% memcpy@plt
                          1.14% calloc@plt
                     + 9.49% restore_char
                     + 5.65% restore_list
                     + 3.23% mutt_new_body
                       3.03% _int_malloc
                     + 2.69% mutt_free_header
                       1.72% __libc_calloc
                       1.67% safe_calloc
                       1.27% __memmove_avx_unaligned_erms
                       1.14% strlen
                     + 0.63% safe_strdup
                       0.63% safe_free

That's 80% of the time opening the maildir. Digging into maildir_delayed_parsing() shows that we're spending a lot of time just calling malloc(3).

Mutt's progress bar

Before going too far down the road of analysing malloc(3) and friends, a function called maildir_parse_dir() caught my eye.

             - 10.66% maildir_parse_dir
                  + 49.14% mutt_progress_update
                    9.14% _int_malloc
                  + 8.81% maildir_parse_flags
                    6.30% __libc_calloc
                    5.18% strrchr
                    4.38% safe_calloc
                    2.83% __strncmp_sse42
                  + 2.22% safe_strdup
                    2.12% strlen
                    1.22% _int_free
                    1.22% calloc@plt
                    1.11% __memmove_avx_unaligned_erms
                    1.09% free
                    0.73% free@plt
                    0.72% mutt_strncmp
                    0.60% mutt_clear_error.part.1
                    0.51% safe_free

It looks as though nearly 50% of maildir_parse_dir()'s duration is spent updating the progress bar. Let's see if we can disable that in the config. Searching in the muttrc for the word "progress" eventually turns up the read_inc option, which controls how mutt displays progress updates when opening mailboxes,

# set read_inc=10
#
# Name: read_inc
# Type: number
# Default: 10
#
#
# If set to a value greater than 0, Mutt will display which message it
# is currently on when reading a mailbox or when performing search actions
# such as search and limit. The message is printed after
# read_inc messages have been read or searched (e.g., if set to 25, Mutt will
# print a message when it is at message 25, and then again when it gets
# to message 50).  This variable is meant to indicate progress when
# reading or searching large mailboxes which may take some time.
# When set to 0, only a single message will appear before the reading
# the mailbox.
#
# Also see the ``$write_inc'' variable and the ``Tuning'' section of the
# manual for performance considerations.

Disabling it by setting it to zero shaves off almost 2 seconds from the open time. We're now at 20.95s.

Well, that was an easy win.

Given that we've changed things, it's important to re-record the CPU profile. maildir_parse_dir() has moved further down the hottest function list.

-   24.78%     0.00%  mutt     mutt                [.] main
   - main
      - 78.87% mx_open_mailbox
         - 54.76% maildir_read_dir
            - mh_read_dir
               + 78.97% maildir_delayed_parsing
               + 11.80% maildir_add_to_context
               + 5.56% maildir_parse_dir
               + 1.73% maildir_free_maildir
               + 0.61% __xstat64
         + 45.24% mutt_sort_headers
      + 21.08% mutt_index_menu

I *still* don't want to delve into the malloc(3) calls yet, so let's look at maildir_add_to_context().

               - 11.80% maildir_add_to_context
                  - 96.08% mx_update_context
                     - 96.80% crypt_query
                        - 92.48% mutt_is_application_pgp
                             60.52% ascii_strcasecmp
                           - 39.48% mutt_get_parameter
                                ascii_strcasecmp
                        + 4.30% mutt_is_multipart_signed
                          1.62% ascii_strcasecmp
                        + 1.07% mutt_is_multipart_encrypted
                          0.54% mutt_get_parameter
                       3.20% mutt_score_message
                  + 2.42% mx_alloc_memory
                    0.94% mutt_score_message

Hmm.. I don't actually use the crypto support. Let's see what a difference it makes if I configure mutt without pgp or smime support. It turns out not much, it still takes about 20 seconds. I might as well leave those config options enabled. We've obviously reached diminishing returns this far down the call path. The next choice is between maildir_read_dir() (54.76%) and mutt_sort_headers() (45.24%). The former concerns memory allocation and sorting maildir files and the latter sorting mutt threads and hashing.

Let's investigate both before making a decision on which to drill down further.

We already know from our previous analysis that malloc(3) plays a large role in this function's call stack. mutt_hcache_restore() -> restore_envelope() -> restore_address() looks like where most of the CPU time is going.

restore_address() allocates memory to hold the rfc822 address from the mail file. Allocating these small objects for every file in the maildir incurs a huge overhead both in memory (every object has associated metadata) and CPU time. Most of the time, there's only one address to restore, so we could optimise that common case, and allocate more buffers if necessary. But the way that the rfc822 address code works means you need to read sizes from the db before allocating data. Not a huge amount of work to optimise, but more than I'd like to do right now.

Mutt's header cache verification

Going back up the call stack and re-reading maildir_delayed_parsing(), I noticed this snippet of code,

#if USE_HCACHE
    if (option(OPTHCACHEVERIFY))
    {
       ret = stat(fn, &lastchanged);
    }
    else
    {
      lastchanged.st_mtime = 0;
      ret = 0;
    }

It turns a whole load of stat(2) calls were caused by having the maildir_header_cache_verify config option enabled. Disabling header cache verification reduces time to 8.9s!

Mutt's strict threads

After this, nothing stood out as being obviously tunable. So I started looking at mutt_sort_headers() and I saw this,

if (!option (OPTSTRICTTHREADS))
    pseudo_threads (ctx);

Which is described in the mutt config file as,

# Name: strict_threads
# Type: boolean
# Default: no
#
#
# If set, threading will only make use of the ``In-Reply-To'' and
# ``References'' fields when you ``$sort'' by message threads.  By
# default, messages with the same subject are grouped together in
# ``pseudo threads.''. This may not always be desirable, such as in a
# personal mailbox where you might have several unrelated messages with
# the subject ``hi'' which will get grouped together. See also
# ``$sort_re'' for a less drastic way of controlling this
# behaviour.

Calculating the hash of a pseudo thread's subject is quite expensive, especially if we don't really care about grouping poorly titled emails together. Disabling it reduces the open time to 8.49s.

Tuning list

Here's the list of tunings that got a 9x improvement in the time for mutt to open my LKML folder:

Enable mutt's header cache, header_cache="path" (3.8x)
Disable mutt's progress bar, read_inc=0 (0.2x)
Disable header cache verification, maildir_header_cache_verify=no (5x)

I think that's a good place to stop (I'm out of useful config options anyway). I don't mind waiting for 8 seconds or so for my LKML mail to appear, especially given that was achievable with simple modifications to my mutt config.

In the next part, I'm going to use the same techniques to investigate why mutt locks up when new LKML mail arrives.