Re: Adding basic NUMA awareness

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tomas Vondra <tomas(at)vondra(dot)me>
Cc: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Alexey Makhmutov <a(dot)makhmutov(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Adding basic NUMA awareness
Date: 2026-01-13 00:24:34
Message-ID: rsjxzhnxxfq5i5yxv66mhinb42o3vzmqpkbfpexpkk5prreh2l@jyp73gsyyfzn
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2026-01-12 19:10:00 -0500, Andres Freund wrote:
> On 2026-01-13 00:58:49 +0100, Tomas Vondra wrote:
> > On 1/10/26 02:42, Andres Freund wrote:
> > > psql -Xq -c 'SELECT pg_buffercache_evict_all();' -c 'SELECT numa_node, sum(size) FROM pg_shmem_allocations_numa GROUP BY 1;' && perf stat --per-socket -M memory_bandwidth_read,memory_bandwidth_write -a psql -c 'SELECT sum(abalance) FROM pgbench_accounts;'
>
> > And then I initialized pgbench with scale that is much larger than
> > shared buffers, but fits into RAM. So cached, but definitely > NB/4. And
> > then I ran
> >
> > select * from pgbench_accounts offset 1000000000;
> >
> > which does a sequential scan with the circular buffer you mention abobe
>
> Did you try it with the query I suggested? One plausible reason why you did
> not see an effect with your query is that with a huge offset you actually
> never deform the tuple, which is an important and rather latency sensitive
> path.

Btw, this doesn't need anywhere close to as much data, it should be visible as
soon as you're >> L3.

To show why
SELECT * FROM pgbench_accounts OFFSET 100000000
doesn't show an effect but
SELECT sum(abalance) FROM pgbench_accounts;

does, just look at the difference using the perf command I posted. Here on a
scale 200.

numactl --membind 0 --cpunodebind 0

offset:

S0 6 47,138,135 UNC_M_CAS_COUNT.WR # 3884.1 MB/s memory_bandwidth_write
S0 20 780,343,577 duration_time
S0 6 61,685,331 UNC_M_CAS_COUNT.RD # 5082.8 MB/s memory_bandwidth_read
S0 20 780,353,818 duration_time
S1 6 1,238,568 UNC_M_CAS_COUNT.WR # 102.1 MB/s memory_bandwidth_write
S1 6 1,475,224 UNC_M_CAS_COUNT.RD # 121.6 MB/s memory_bandwidth_read

0.776715450 seconds time elapsed

agg:

S0 6 53,145,706 UNC_M_CAS_COUNT.WR # 2000.8 MB/s memory_bandwidth_write
S0 20 1,706,046,493 duration_time
S0 6 111,390,488 UNC_M_CAS_COUNT.RD # 4193.5 MB/s memory_bandwidth_read
S0 20 1,706,057,341 duration_time
S1 6 3,968,454 UNC_M_CAS_COUNT.WR # 149.4 MB/s memory_bandwidth_write
S1 6 4,026,212 UNC_M_CAS_COUNT.RD # 151.6 MB/s memory_bandwidth_read

numactl --membind 0 --cpunodebind 1

offset:

S0 6 91,982,003 UNC_M_CAS_COUNT.WR # 7036.4 MB/s memory_bandwidth_write
S0 20 842,785,290 duration_time
S0 6 113,076,316 UNC_M_CAS_COUNT.RD # 8650.1 MB/s memory_bandwidth_read
S0 20 842,797,430 duration_time
S1 6 1,545,612 UNC_M_CAS_COUNT.WR # 118.2 MB/s memory_bandwidth_write
S1 6 2,354,087 UNC_M_CAS_COUNT.RD # 180.1 MB/s memory_bandwidth_read

0.836623794 seconds time elapsed

agg:

S0 6 133,267,754 UNC_M_CAS_COUNT.WR # 3980.9 MB/s memory_bandwidth_write
S0 20 2,146,221,284 duration_time
S0 6 159,951,549 UNC_M_CAS_COUNT.RD # 4777.9 MB/s memory_bandwidth_read
S0 20 2,146,233,675 duration_time
S1 6 71,543,708 UNC_M_CAS_COUNT.WR # 2137.1 MB/s memory_bandwidth_write
S1 6 49,584,957 UNC_M_CAS_COUNT.RD # 1481.2 MB/s memory_bandwidth_read

2.142535432 seconds time elapsed

Note how much bigger the absolute numbers of reads and writes are for the
aggregate compared to the offset.

Interestingly I do see a performance difference, albeit a smaller one, even
with OFFSET. I see similar numbers on two different 2 socket machines.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2026-01-13 00:33:56 Re: Buffer locking is special (hints, checksums, AIO writes)
Previous Message David G. Johnston 2026-01-13 00:20:28 Re: docs: clarify ALTER TABLE behavior on partitioned tables