Identify huge pages accessibility using madvise

From: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Identify huge pages accessibility using madvise
Date: 2024-04-13 16:22:55
Message-ID: 20240413162255.56xzlbhoolw2vyqv@ddolgov.remote.csb
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I would like to propose a small patch to address an annoying issue with
the way how PostgreSQL does fallback in case if "huge_pages = try" is
set. Here is how the problem looks like:

* PostgreSQL is starting on a machine with some huge pages available

* It tries to identify that fact and does mmap with MAP_HUGETLB, which
succeeds

* But it has a pleasure to run inside a cgroup with a hugetlb
controller and limits set to 0 (or anything less than PostgreSQL
needs)

* Under this circumstances PostgreSQL will proceed allocating huge
pages, but the first page fault will trigger SIGBUS

I've sketched out how to reproduce it with cgroup v1 and v2 in the
attached scripts.

This sounds like quite a rare combination of factors, but apparently
it's fairly easy to face this on K8s/OpenShift. There was a bug reported
some time ago [1] about this behaviour, and back then I was under the
impression it's a solved matter with nothing to do. Yet I still observe
this type of issues, the latest one not longer than a week ago.

After some research I found what looks to me like a relatively simple
way to address the problem. In Linux kernel 5.14 a new flag to madvise
was introduced that might be just what we need here. It's called
MADV_POPULATE_READ [2] and it tells kernel to populate page tables by
triggering read faults if required. One by-design feature of this flag
is to fail the madvise call in the situations like one above, giving an
opportunity to avoid SIGBUS.

I've outlined a patch to implement this approach and tested it on a
newish Linux kernel I've got lying around (6.9.0-rc1) -- no SIGBUS,
PostgreSQL does fallback to not use huge pages. The resulting change
seems to be small enough to justify addressing this small but annoying
issue. Any thoughts or commentaries about the proposal?

[1]: https://www.postgresql.org/message-id/flat/HE1PR0701MB256920EEAA3B2A9C06249F339E110%40HE1PR0701MB2569.eurprd07.prod.outlook.com
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4ca9b3859dac14bbef0c27d00667bb5b10917adb

Attachment Content-Type Size
v1-0001-Identify-huge-pages-accesibility-using-madvise.patch text/plain 3.3 KB
sigbus.sh application/x-sh 523 bytes
cgroup-v1.sh application/x-sh 415 bytes
cgroup-v2.sh application/x-sh 599 bytes

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2024-04-13 18:05:36 Re: Why is parula failing?
Previous Message Tom Lane 2024-04-13 14:53:53 Re: In MacOS, psql reacts on SIGINT in a strange fashion (Linux is fine)