| From: | Tomas Vondra <tomas(at)vondra(dot)me> |
|---|---|
| To: | Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com> |
| Cc: | Alexey Makhmutov <a(dot)makhmutov(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de> |
| Subject: | Re: Adding basic NUMA awareness |
| Date: | 2025-12-08 20:02:27 |
| Message-ID: | 403c813e-7d63-4ee8-a95f-4e5c1e310f4d@vondra.me |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi,
I've spent the last couple days considering what to do about this patch
series in this thread. The approach assumes partitioning shared memory
in a NUMA-aware way has enough benefits to justify the extra complexity.
I'm getting somewhat skeptical about that being a good trade off.
We've been able to demonstrate some significant benefits of the patches.
See for example the results from about a month ago [1], showing that in
some cases the throughput almost doubles.
But I'm skeptical about this for two reasons:
* Most of the benefit comes from patches unrelated to NUMA. The initial
patches partition clockweep, in a NUMA oblivious way. In fact, applying
the NUMA patches often *reduces* the throughput. So if we're concerned
about contention on the clocksweep hand, we could apply just these first
patches. That way we wouldn't have to deal with huge pages.
* Furthermore, I'm not quite sure clocksweep really is a bottleneck in
realistic cases. The benchmark used in this thread does many concurrent
sequential scans, on data that exceeds shared buffers / fits into RAM.
Perhaps that happens, but I doubt it's all that common.
I've been unable to demonstrate any benefits on other workloads, even if
there's a lot of buffer misses / reads into shared buffers. As soon as
the query starts doing something else, the clocksweep contention becomes
a non-issue. Consider for example read-only pgbench with database much
larger than shared buffers (but still within RAM). The cost of the index
scans (and other nodes) seems to reduce the pressure on clocksweep.
So I'm skeptical about clocksweep pressure being a serious issue, except
for some very narrow benchmarks (like the concurrent seqscan test). And
even if this happened for some realistic cases, partitioning the buffers
in a NUMA-oblivious way seems to do the trick.
When discussing this stuff off list, it was suggested this might help
with the scenario Andres presented in [3], where the throughput improves
a lot with multiple databases. I've not observed that in practice, and I
don't think these patches really can help with that. That scenario is
about buffer lock contention, not clocksweep contention.
With a single database there's nothing to do - there's one contended
page, located on a single node. There'll be contention, the no matter
which node it ends up on. With multiple databases (or multiple root
pages), it either happens to work by chance (if the buffers happen to be
from different nodes), or it would require figuring out which buffers
are busy, and place them on different nodes. But the patches did not
even try to do anything like that. So it still was a matter of chance.
That does not mean we don't need to worry about NUMA. There's still the
issue of misbalancing the allocation - with memory coming from just one
node, etc. Which is an issue because it "overloads" the memory subsystem
on that particular node. But that can be solved simply by interleaving
the shared memory segment.
That sidesteps most of the complexity - it does not need to figure out
how to partition buffers, does not need to worry about huge pages, etc.
This is not a new idea - it's more or less what Jakub Wartak initially
proposed in [2], before I hijacked the thread into the more complex
approach.
Attached is a tiny patch doing mostly what Jakub did, except that it
does two things. First, it allows interleaving the shared memory on all
relevant NUMA nodes (per numa_get_mems_allowed). Second, it allows
populating all memory by setting MAP_POPULATE in mmap(). There's a new
GUC to enable each of these.
I think we should try this (much simpler) approach first, or something
close to it. Sorry for dragging everyone into a much more complex
approach, which now seems to be a dead end.
regards
[1]
https://www.postgresql.org/message-id/e4d7e6fc-b5c5-4288-991c-56219db2edd5@vondra.me
[3] https://youtu.be/V75KpACdl6E?t=2120
--
Tomas Vondra
| Attachment | Content-Type | Size |
|---|---|---|
| v20251208-0001-numa-Simple-interleaving-and-MAP_POPULATE.patch | text/x-patch | 8.9 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Greg Burd | 2025-12-08 20:39:25 | Re: pg_plan_advice |
| Previous Message | Álvaro Herrera | 2025-12-08 19:45:07 | Re: vacuumdb: add --dry-run |