| From: | Andres Freund <andres(at)anarazel(dot)de> |
|---|---|
| To: | Tomas Vondra <tomas(at)vondra(dot)me> |
| Cc: | Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Alexey Makhmutov <a(dot)makhmutov(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Adding basic NUMA awareness |
| Date: | 2026-01-15 00:01:38 |
| Message-ID: | w2fqzrcwo6ofjy56e5pd7hsjdnlhc5tckgpsio77sqtgcylbvx@eeknrzc7o7ov |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi,
On 2026-01-15 00:26:47 +0100, Tomas Vondra wrote:
> D96 (v6):
>
> Numa node
> Numa node 0 1
> 0 129.9 129.9
> 1 128.3 128.1
I wonder if D96 has turned on memory interleaving... These numbers are so
close to each other that they're a tad hard to believe.
>
> HB176 (v4):
>
> Numa node
> Numa node 0 1 2 3
> 0 107.3 116.8 207.3 207.0
> 1 120.5 110.6 207.5 207.1
> 2 207.0 207.2 107.8 116.8
> 3 204.4 204.7 117.7 107.9
>
> I guess this confirms that D96 is mostly useless for evaluation of the
> NUMA patches. This is a single-socket machine, with one NUMA node per
> chiplet (I assume), and there's about no difference in latency.
> For HB176 there clearly seems to be a difference of ~90ns between the
> sockets, i.e. the latency about doubles in some cases. Each socket has
> two chiplets - and there the story is about the same as on D96.
It looks to me like within a socket there is a latency difference of about
10ns? Only when going between sockets there's no difference between which of
the remote nodes is accessed - which makes sense to me.
For newer single-node EPYC
https://chipsandcheese.com/p/amds-epyc-9355p-inside-a-32-core
has some numbers for within socket latencies. They also see about 10ns between
inside-socket-local and inside-socket-remote.
> I did this on my old-ish Xeon too, and it's somewhere in between. There
> clearly is difference between the sockets, but it's smaller than on
> HB176. Which matches with your observation that the latency is really
> increasing over time.
FWIW https://chipsandcheese.com/p/a-look-into-intel-xeon-6s-memory has some
numbers in the "NUMA/Chiplet Characteristics" too.
One aspect in it caught my eye:
> Thus accesses to a remote NUMA node are only cached by the remote die’s
> L3. Accessing the L3 on an adjacent die increases latency by about 24
> ns. Crossing two die boundaries adds a similar penalty, increasing latency
> to nearly 80 ns for a L3 hit
Afaict that translates to an L3 hit consistently taking 80ns when accessing
remote memory, that's quite something.
> I doubt the interleaving mode is enabled. It clearly is not enabled on
> the HB176 machine (otherwise we wouldn't see the difference, I think),
> and the smaller instance can be explained by having a single socket.
As you say, there obviously is no interleaving on the HB176. I do wonder about
the D96, but ...
I wonder if the configuration is somehow visible in MSRs...
> The numbers are timings per query (avg latency reported by pgbench). I
> think this mostly aligns with the mlc results - the D96 shows no
> difference, while HB176 shows clear differences when memory/cpu get
> pinned to different sockets (but not chiplets in the same socket).
Yea, that makes sense.
> But there are some interesting details too, particularly when it comes
> to behavior of the two queries. The "offset" query is affected by
> latency even with no parallelism (max_parallel_workers_per_gather=0),
> and it shows ~30% hit for cross-socket runs. But for "agg" there's no
> difference in that case, and the hit is visible only with 4 or 8
> workers. That's interesting.
Huh, that *is* interesting. I guess the hardware prefetchers are good enough
to prefetch of the tuple headers in this case, possibly because the tuples are
small and regular enough that the hardware prefetchers manage to prefetch
everything in time?
E.g. https://docs.amd.com/api/khub/documents/goX~9ubv8i5r60A_Qrp3Rw/content
documents "L1 Stride Prefetcher" as
> The prefetcher uses the L1 cache memory access history of individual
> instructions to fetch additional lines when each access is a constant
> distance from the previous.
Greetings,
Andres Freund
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Chao Li | 2026-01-15 00:04:21 | Re: Buffer locking is special (hints, checksums, AIO writes) |
| Previous Message | Chao Li | 2026-01-15 00:00:31 | Re: Refactor replication origin state reset helpers |