| From: | Tomas Vondra <tomas(at)vondra(dot)me> |
|---|---|
| To: | Andres Freund <andres(at)anarazel(dot)de> |
| Cc: | Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Alexey Makhmutov <a(dot)makhmutov(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Adding basic NUMA awareness |
| Date: | 2026-01-13 01:13:40 |
| Message-ID: | 2db78610-b480-4aa0-a1b6-57f1c2dcb708@vondra.me |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On 1/13/26 01:24, Andres Freund wrote:
> Hi,
>
> On 2026-01-12 19:10:00 -0500, Andres Freund wrote:
>> On 2026-01-13 00:58:49 +0100, Tomas Vondra wrote:
>>> On 1/10/26 02:42, Andres Freund wrote:
>>>> psql -Xq -c 'SELECT pg_buffercache_evict_all();' -c 'SELECT numa_node, sum(size) FROM pg_shmem_allocations_numa GROUP BY 1;' && perf stat --per-socket -M memory_bandwidth_read,memory_bandwidth_write -a psql -c 'SELECT sum(abalance) FROM pgbench_accounts;'
>>
>>> And then I initialized pgbench with scale that is much larger than
>>> shared buffers, but fits into RAM. So cached, but definitely > NB/4. And
>>> then I ran
>>>
>>> select * from pgbench_accounts offset 1000000000;
>>>
>>> which does a sequential scan with the circular buffer you mention abobe
>>
>> Did you try it with the query I suggested? One plausible reason why you did
>> not see an effect with your query is that with a huge offset you actually
>> never deform the tuple, which is an important and rather latency sensitive
>> path.
>
> Btw, this doesn't need anywhere close to as much data, it should be visible as
> soon as you're >> L3.
>
> To show why
> SELECT * FROM pgbench_accounts OFFSET 100000000
> doesn't show an effect but
> SELECT sum(abalance) FROM pgbench_accounts;
>
> does, just look at the difference using the perf command I posted. Here on a
> scale 200.
>
OK, I tried with smaller scale (and larger shared buffers, to make the
data set smaller than NBuffers/4).
On the azure VM (scale 200, 32GB sb), there's still no difference:
numactl --membind 0 --cpunodebind 0
297.770 ms
numactl --membind 0 --cpunodebind 1
297.924 ms
and on xeon (scale 100, 8GB sb), there's a bit of a difference:
numactl --membind 0 --cpunodebind 0
236.451 ms
numactl --membind 0 --cpunodebind 1
298.418 ms
So roughly 20%. There's also a bigger difference in the perf, about
5944.3 MB/s vs. 5202.3 MB/s.
>
> Interestingly I do see a performance difference, albeit a smaller one, even
> with OFFSET. I see similar numbers on two different 2 socket machines.
>
I wonder how significant is the number of sockets. The Azure is a single
socket with 2 NUMA nodes, so maybe the latency differences are not
significant enough to affect this kind of tests.
The xeon is a 2-socket machine, but it's also older (~10y).
regards
--
Tomas Vondra
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tomas Vondra | 2026-01-13 01:25:57 | Re: Adding basic NUMA awareness |
| Previous Message | Andres Freund | 2026-01-13 01:08:31 | Re: Adding basic NUMA awareness |