| From: | Tomas Vondra <tomas(at)vondra(dot)me> |
|---|---|
| To: | Andres Freund <andres(at)anarazel(dot)de> |
| Cc: | Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Alexey Makhmutov <a(dot)makhmutov(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Adding basic NUMA awareness |
| Date: | 2026-01-14 23:26:47 |
| Message-ID: | 2e1be441-5608-4bc8-9ac4-6fad9a060db4@vondra.me |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On 1/13/26 15:14, Andres Freund wrote:
> Hi,
>
> On 2026-01-13 02:13:40 +0100, Tomas Vondra wrote:
>> On the azure VM (scale 200, 32GB sb), there's still no difference:
>
> One possibility is that the host is configured with memory interleaving. That
> configures the memory so that physical memory addresses interleave between the
> different NUMA nodes, instead of really being node local. That can help avoid
> bad performance characteristics for NUMA naive applications.
>
> I don't quite know how to figure that out though, particularly from within a
> VM :(. Even something like https://github.com/nviennot/core-to-core-latency
> or intel's mlc will not necessarily be helpful, because it depends on which
> node the measured cacheline ends up on.
>
> But I'd probably still test it, just to see whether you're observing very
> different latencies between the systems.
>
I did this on the two Azure instances I've been using for testing (D96
and HB176), and I got this:
D96 (v6):
Numa node
Numa node 0 1
0 129.9 129.9
1 128.3 128.1
HB176 (v4):
Numa node
Numa node 0 1 2 3
0 107.3 116.8 207.3 207.0
1 120.5 110.6 207.5 207.1
2 207.0 207.2 107.8 116.8
3 204.4 204.7 117.7 107.9
I guess this confirms that D96 is mostly useless for evaluation of the
NUMA patches. This is a single-socket machine, with one NUMA node per
chiplet (I assume), and there's about no difference in latency.
For HB176 there clearly seems to be a difference of ~90ns between the
sockets, i.e. the latency about doubles in some cases. Each socket has
two chiplets - and there the story is about the same as on D96.
I did this on my old-ish Xeon too, and it's somewhere in between. There
clearly is difference between the sockets, but it's smaller than on
HB176. Which matches with your observation that the latency is really
increasing over time.
I doubt the interleaving mode is enabled. It clearly is not enabled on
the HB176 machine (otherwise we wouldn't see the difference, I think),
and the smaller instance can be explained by having a single socket.
I've attached the complete mlc results, for completeness.
I've also done bigger SQL test with pinning the memory/backends to
different nodes, for a range of scales and the two queries (agg and
offset). I'm attaching results for scale 100 and 10000 from D96 and
HB176 instances.
The numbers are timings per query (avg latency reported by pgbench). I
think this mostly aligns with the mlc results - the D96 shows no
difference, while HB176 shows clear differences when memory/cpu get
pinned to different sockets (but not chiplets in the same socket).
But there are some interesting details too, particularly when it comes
to behavior of the two queries. The "offset" query is affected by
latency even with no parallelism (max_parallel_workers_per_gather=0),
and it shows ~30% hit for cross-socket runs. But for "agg" there's no
difference in that case, and the hit is visible only with 4 or 8
workers. That's interesting.
Anyway, my plan at this point is to revive the old patch (before
changing direction to the simple patch), and see if we can observe a
difference on the "right" hardware. Maybe some of the results with no
improvements were due to this. This workload seems much more realistic.
regards
--
Tomas Vondra
| Attachment | Content-Type | Size |
|---|---|---|
| numa-xeon.txt | text/plain | 2.0 KB |
| numa-hb176-epyc-9v33x.txt | text/plain | 2.6 KB |
| numa-d96-epyc-9v74.txt | text/plain | 1.8 KB |
| d96-100.pdf | application/pdf | 36.1 KB |
| d96-10000.pdf | application/pdf | 36.1 KB |
| hb176-10000.pdf | application/pdf | 38.8 KB |
| hb176-100.pdf | application/pdf | 38.5 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tom Lane | 2026-01-14 23:30:14 | Re: Proposal to allow setting cursor options on Portals |
| Previous Message | Chao Li | 2026-01-14 23:20:27 | Re: Buffer locking is special (hints, checksums, AIO writes) |