After booting my openSUSE Tumbleweed laptop this morning, I was greeted with the following error messages in the console log:

[  459.834935] systemd-journald[706]: Failed to rotate /var/log/journal/dae9a2341321bbb6224dafc6561f8eb1/system.journal: No space left on device
[  459.841371] systemd-journald[706]: Failed to rotate /var/log/journal/dae9a2341321bbb6224dafc6561f8eb1/user-1000.journal: No space left on device
[  459.841551] systemd-journald[706]: Failed to write entry (24 items, 750 bytes), ignoring: Input/output error

It looked like the /var/log btrfs filesystem had run out of free space, and that systemd could not create new files for its journal.

Thinking this was a run of the mill case of using up all my disk space – probably because of some huge log file – I reached for df(1), and printed the filesystem statistics.

What I saw was totally unexpected:

$ df -h /var/log
Filesystem               Size  Used Avail Use% Mounted on
/dev/mapper/system-root   40G   36G  3.8G  91% /var/log

Say what? If only 91% of the filesystem was in use, then why was I hitting out of space errors?

Because, while there was plenty of available data space, I had exhausted the free metadata space.

Reclaim Metadata Space with snapper

To fix this situation, so that I could create new files again, I had to delete most of the btrfs snapshots for the / filesystem.

The default installation of openSUSE Tumbleweed creates a single btrfs filesystem for the root filesystem (/), with btrfs subvolumes for various other directories. Running out of free metadata space in a subvolume (/var/log) means you’ve run out of space in the main volume (/).

You can query the snapshots of the root filesystem using the snapper command:

$ snapper -c root list
Type   | #  | Pre # | Date                     | User | Cleanup | Description           | Userdata     
-------+----+-------+--------------------------+------+---------+-----------------------+--------------
single | 0  |       |                          | root |         | current               |              
single | 1  |       | Thu Oct 22 09:51:54 2015 | root |         | first root filesystem |              
pre    | 11 |       | Tue Apr 25 16:24:54 2017 | root | number  | zypp(y2base)          | important=yes
pre    | 19 |       | Tue Apr 25 16:35:21 2017 | root | number  | zypp(zypper)          | important=yes
post   | 20 | 19    | Tue Apr 25 20:08:26 2017 | root | number  |                       | important=yes
pre    | 23 |       | Wed Apr 26 10:46:19 2017 | root | number  | zypp(zypper)          | important=yes
post   | 24 | 23    | Wed Apr 26 10:48:50 2017 | root | number  |                       | important=yes
pre    | 31 |       | Wed Apr 26 11:04:17 2017 | root | number  | zypp(zypper)          | important=no 
post   | 32 | 31    | Wed Apr 26 11:04:24 2017 | root | number  |                       | important=no 
pre    | 33 |       | Wed Apr 26 11:05:52 2017 | root | number  | zypp(zypper)          | important=no 
post   | 34 | 33    | Wed Apr 26 11:05:59 2017 | root | number  |                       | important=no 
pre    | 35 |       | Fri Apr 28 11:13:25 2017 | root | number  | zypp(zypper)          | important=no 
post   | 36 | 35    | Fri Apr 28 11:15:23 2017 | root | number  |                       | important=no 
pre    | 37 |       | Wed May  3 11:48:58 2017 | root | number  | zypp(zypper)          | important=yes
post   | 38 | 37    | Wed May  3 11:52:19 2017 | root | number  |                       | important=yes
pre    | 39 |       | Fri May  5 08:47:15 2017 | root | number  | zypp(zypper)          | important=yes
post   | 40 | 39    | Fri May  5 08:47:45 2017 | root | number  |                       | important=yes
pre    | 41 |       | Thu May 11 09:21:08 2017 | root | number  | zypp(zypper)          | important=yes

The pre and post snapshots are generated before and after YaST runs, respectively. As you can see from the snapshot descriptions, YaST was executed on behalf of zypper; the openSUSE package manager.

Every time I update packages with zypper update, two btrfs snapshots are created. Because openSUSE Tumbleweed is a rolling release version of openSUSE, this happens a lot.

To free up some metadata space I deleted every snapshot apart from the current (snapshot #0) and initial snapshot (snapshot #1).

WARNING: The whole point of creating snapshots is to allow zypper to rollback to previous states if you encounter issues. That will become impossible if you follow the next step.

$ for snapshot in `echo $(seq 41 -1 31)` 24 23 20 19 11; do
	snapper -c root rm $snapshot
done

After that, it was possible to create files in /var/lib again, and systemd was much happier.

If you’re curious to know how I debugged this issue, and how you can do the same for your btrfs filesystems, read on.

Getting Serious with Btrfs Utilities

One of the most surprising things about this whole ordeal was that btrfs made zero attempts to tell me that I was hitting a metadata ENOSPC issue; there was no error message of any kind in the console log.

I think it’s a natural instinct to pull out df(1) when you hit out of space issues. But when it showed that I had plenty of free space on the /var/log partition, I was stumped.

It turns out that btrfs has its own utilities for inspecting partitions; all of them are described in the man page btrfs(8). The one I needed was btrfs-filesystem(8), or btrfs fi for short.

$ btrfs fi df /var
Data, single: total=36.44GiB, used=32.65GiB
System, DUP: total=32.00MiB, used=16.00KiB
Metadata, DUP: total=1.75GiB, used=1.31GiB
GlobalReserve, single: total=448.00MiB, used=0.00B

Looking at the Metadata line shows that there’s roughly 0.44GiB of metadata space available. So why the error?

On every btrfs filesystem, some space is reserved so that the kernel can always perform critical operations, like deleting files, even when the filesystem is full. The GlobalReserve line above gives the size of this emergency space.

Crucially, the GlobalReserve space is taken from the Metadata pool, so you need to add the used Metadata size and the total GlobalReserve size to calculate how much free metadata space you actually have.

The btrfs-filesystem(8) man page even has this very helpful equation:

In case the filesystem metadata are exhausted, \(\begin{align*} GlobalReserve_{total} + Metadata_{used} = Metadata_{total} \end{align*}\)

After I deleted the snapshots I saw a much better picture of filesystem statistics:

$ btrfs fi df /var
Data, single: total=36.44GiB, used=20.37GiB
System, DUP: total=32.00MiB, used=16.00KiB
Metadata, DUP: total=1.75GiB, used=661.02MiB
GlobalReserve, single: total=224.00MiB, used=0.00B

A Permanent Fix

The SUSE btrfs developers area aware of this issue. But that bug is assigned to the kernel developers because they were interested in thinking about how to make this scenario less painful at the kernel level.

Which begs the question: Shouldn’t snapper be deleting old snapshots so that I don’t run out of disk space and hit this issue again?

While snapper can delete snapshots once a certain number have been created, it does not provide a way to reduce that number if metadata space starts to run out on the filesystem.

My only hope for a permanent fix today is to limit snapper to keeping a single pre and a single post snapshot for important and unimportant updates. That and pray they don’t include too many files and directories.

Again, note that the following suggestion is going to limit how far you can rollback your system, should it stop working properly.

You can make the fix permanent by modifying /etc/snapper/configs/root with the following settings:

# limit for number cleanup
NUMBER_MIN_AGE="1800"
NUMBER_LIMIT="2"
NUMBER_LIMIT_IMPORTANT="2"

Fingers crossed.