hur.st's bl.aagh

BSD, Ruby, Rust, Rambling

Excessive Swapping with FreeBSD NUMA

Poor VM behaviour with ZFS ARC and unbalanced NUMA domains

[freebsd]

Early last year I finally diagnosed a long-standing issue with my FreeBSD machine, which always had gobs of memory for things but which frequently encountered swap pressure for no discenerable reason.

This is a slightly edited copy of an unpublushed report I wrote back then, prior to the hardware being retired. Maybe it’ll be useful to someone.


When migrating data from one large ZFS pool to another, I found my swap device filling up completely, despite enormous amounts of free memory (>50GB) and a relatively severe arc_max limit (64GB for a 160GB machine).

Further investigation noted that NUMA domain 0 (96GB) had the majority of the system’s free memory (90%+), while domain 1 (64GB) kept dipping past free_target, and rather than spilling over to the other domain, the system was choosing to swap out pages.

Dropping arc_max to 32GB has improved the situation, but is obviously far from ideal when so much memory is available.

This pdf of slides from MeetBSD 2018 contains a notable quote under “Low memory handling”: – “Does not work well if most of a domain is wired (e.g., by ARC)”

And similar behaviour is mentioned in the Phabricator review when NUMA was enabled by default.

I’ve seen many complaints about FreeBSD swapping with ZFS seemingly without reason – and as mentioned I’ve ran into this behaviour myself repeatedly – but I’ve never seen this mentioned as a potential cause.

To give some numbers, here’s a small sample of NUMA page activity from numastat during a large file copy when increasing arc_max 32GB back to 64GB.

DOMAIN      ACTIVE  INACTIVE   LAUNDRY      FREE
--------  --------  --------  --------  --------
0            1.91G    18.93G     4.91G    42.92G
1            1.76G   521.39M     5.24G     9.90G

After a few seconds it’s obvious ARC is completely favouring domain 1, eventually pushing it into its free page limits:

0            1.90G    18.93G     4.91G    42.91G
1            1.76G   521.50M     5.24G     2.20G
--------  --------  --------  --------  --------
0            1.90G    18.93G     4.91G    42.91G
1            1.76G   521.50M     5.24G     1.66G
--------  --------  --------  --------  --------
0            1.90G    18.93G     4.91G    45.05G
1            1.76G   521.50M     5.24G     2.96G
--------  --------  --------  --------  --------
0            1.90G    18.93G     4.91G    46.24G
1            1.76G   521.50M     5.24G     3.97G

After perhaps a dozen more seconds it’s dipped low enough to spill several gigabytes into swap:

0            1.91G    18.93G     4.91G    45.54G
1            1.72G    37.23M     5.48G     1.36G
--------  --------  --------  --------  --------
0            1.91G    18.93G     4.91G    45.39G
1            2.40G    12.38M     4.75G     1.06G
--------  --------  --------  --------  --------
0            1.91G    18.93G     4.91G    45.00G
1            2.48G    51.25M     4.49G   958.32M
--------  --------  --------  --------  --------
0            1.91G    18.94G     4.90G    44.90G
1            1.98G    11.13M     4.83G     1.10G
--------  --------  --------  --------  --------
0            1.91G    18.97G     4.87G    44.87G
1            2.24G     8.42M     4.42G     1.24G

As a workaround, I ended up disabling NUMA support in /boot/loader.conf:

vm.numa.disabled=1

This combines both domains into a single unit, at some performance cost. I considered this a fair trade for not constantly running out of swap space!