Excessive Swapping with FreeBSD NUMA
Poor VM behaviour with ZFS ARC and unbalanced NUMA domains
Early last year I finally diagnosed a long-standing issue with my FreeBSD machine, which always had gobs of memory for things but which frequently encountered swap pressure for no discenerable reason.
This is a slightly edited copy of an unpublushed report I wrote back then, prior to the hardware being retired. Maybe it’ll be useful to someone.
When migrating data from one large ZFS pool to another, I found my swap device filling up completely, despite enormous amounts of free memory (>50GB) and a relatively severe arc_max limit (64GB for a 160GB machine).
Further investigation noted that NUMA domain 0 (96GB) had the majority of the system’s free memory (90%+), while domain 1 (64GB) kept dipping past free_target, and rather than spilling over to the other domain, the system was choosing to swap out pages.
Dropping arc_max to 32GB has improved the situation, but is obviously far from ideal when so much memory is available.
This pdf of slides from MeetBSD 2018 contains a notable quote under “Low memory handling”: – “Does not work well if most of a domain is wired (e.g., by ARC)”
And similar behaviour is mentioned in the Phabricator review when NUMA was enabled by default.
I’ve seen many complaints about FreeBSD swapping with ZFS seemingly without reason – and as mentioned I’ve ran into this behaviour myself repeatedly – but I’ve never seen this mentioned as a potential cause.
To give some numbers, here’s a small sample of NUMA page activity from numastat during a large file copy when increasing arc_max 32GB back to 64GB.
DOMAIN ACTIVE INACTIVE LAUNDRY FREE
-------- -------- -------- -------- --------
0 1.91G 18.93G 4.91G 42.92G
1 1.76G 521.39M 5.24G 9.90G
After a few seconds it’s obvious ARC is completely favouring domain 1, eventually pushing it into its free page limits:
0 1.90G 18.93G 4.91G 42.91G
1 1.76G 521.50M 5.24G 2.20G
-------- -------- -------- -------- --------
0 1.90G 18.93G 4.91G 42.91G
1 1.76G 521.50M 5.24G 1.66G
-------- -------- -------- -------- --------
0 1.90G 18.93G 4.91G 45.05G
1 1.76G 521.50M 5.24G 2.96G
-------- -------- -------- -------- --------
0 1.90G 18.93G 4.91G 46.24G
1 1.76G 521.50M 5.24G 3.97G
After perhaps a dozen more seconds it’s dipped low enough to spill several gigabytes into swap:
0 1.91G 18.93G 4.91G 45.54G
1 1.72G 37.23M 5.48G 1.36G
-------- -------- -------- -------- --------
0 1.91G 18.93G 4.91G 45.39G
1 2.40G 12.38M 4.75G 1.06G
-------- -------- -------- -------- --------
0 1.91G 18.93G 4.91G 45.00G
1 2.48G 51.25M 4.49G 958.32M
-------- -------- -------- -------- --------
0 1.91G 18.94G 4.90G 44.90G
1 1.98G 11.13M 4.83G 1.10G
-------- -------- -------- -------- --------
0 1.91G 18.97G 4.87G 44.87G
1 2.24G 8.42M 4.42G 1.24G
As a workaround, I ended up disabling NUMA support in /boot/loader.conf
:
vm.numa.disabled=1
This combines both domains into a single unit, at some performance cost. I considered this a fair trade for not constantly running out of swap space!