| From: | Andres Freund <andres(at)anarazel(dot)de> |
|---|---|
| To: | Tomas Vondra <tomas(at)vondra(dot)me> |
| Cc: | Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Alexey Makhmutov <a(dot)makhmutov(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Adding basic NUMA awareness |
| Date: | 2026-01-13 01:08:31 |
| Message-ID: | 4jqqz454it4ims43k5ftzjfci5l4klhw5rw54xqx4pyirm7pqr@kdyywhl2zro6 |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi,
On 2026-01-13 01:51:09 +0100, Tomas Vondra wrote:
> On 1/13/26 01:10, Andres Freund wrote:
> > Hi,
> >
> > On 2026-01-13 00:58:49 +0100, Tomas Vondra wrote:
> >> On 1/10/26 02:42, Andres Freund wrote:
> >>> psql -Xq -c 'SELECT pg_buffercache_evict_all();' -c 'SELECT numa_node, sum(size) FROM pg_shmem_allocations_numa GROUP BY 1;' && perf stat --per-socket -M memory_bandwidth_read,memory_bandwidth_write -a psql -c 'SELECT sum(abalance) FROM pgbench_accounts;'
> >
> >> And then I initialized pgbench with scale that is much larger than
> >> shared buffers, but fits into RAM. So cached, but definitely > NB/4. And
> >> then I ran
> >>
> >> select * from pgbench_accounts offset 1000000000;
> >>
> >> which does a sequential scan with the circular buffer you mention abobe
> >
> > Did you try it with the query I suggested? One plausible reason why you did
> > not see an effect with your query is that with a huge offset you actually
> > never deform the tuple, which is an important and rather latency sensitive
> > path.
> >
>
> I did try with the agg query too, and there's still no difference on
> either machine.
Could you provide numactl --hardware for both? There may be more than two
numa nodes on a system with 2 sockets, due to one socket being split into two
- in which case the latency between 0,1 might be a lot lower than say 0 and 3.
> I can't do the perf on the Azure VM, because the Ubuntu is image is
> borked and does not allow installing the package. But on my xeon I can
> do the perf, and that gives me this:
>
> numactl --membind=0 --cpunodebind=0 ~/builds/master-test/bin/pg_ctl
> -----------------------------------------------------------------------
> S0 1 24,677,226 UNC_M_CAS_COUNT.WR # 79.0 MB/s ... idth_write
> S0 1 20,001,829,522 ns duration_time ...
> S0 1 972,631,426 UNC_M_CAS_COUNT.RD # 3112.2 MB/s ... idth_read
> S0 1 20,001,822,807 ns duration_time ...
> S1 1 15,602,233 UNC_M_CAS_COUNT.WR # 49.9 MB/s ... idth_write
> S1 1 712,431,146 UNC_M_CAS_COUNT.RD # 2279.6 MB/s ... idth_read
>
>
> numactl --membind=0 --cpunodebind=1 ~/builds/master-test/bin/pg_ctl
> -----------------------------------------------------------------------
> S0 1 47,931,019 UNC_M_CAS_COUNT.WR # 153.4 MB/s ... idth_write
> S0 1 20,002,933,380 ns duration_time ...
> S0 1 1,007,386,994 UNC_M_CAS_COUNT.RD # 3223.2 MB/s ... idth_read
> S0 1 20,002,927,341 ns duration_time ...
> S1 1 10,310,201 UNC_M_CAS_COUNT.WR # 33.0 MB/s ... idth_write
> S1 1 714,826,668 UNC_M_CAS_COUNT.RD # 2287.2 MB/s ... idth_read
>
> so there is a little bit of a difference for some stats, but not much.
>
>
> FWIW this is from
>
> perf stat --per-socket -M memory_bandwidth_read,memory_bandwidth_write
> -a -- sleep 20
>
> while the agg query runs in a loop.
FWIW doing one perf stat for each execution is preferrable for comparison,
because otherwise you can hide large differences in total number of memory
accesses if the runtimes for the queries in the two "numa configurations" are
different.
Greetings,
Andres Freund
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tomas Vondra | 2026-01-13 01:13:40 | Re: Adding basic NUMA awareness |
| Previous Message | Tomas Vondra | 2026-01-13 00:51:09 | Re: Adding basic NUMA awareness |