Re: Adding basic NUMA awareness

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tomas Vondra <tomas(at)vondra(dot)me>
Cc: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Subject: Re: Adding basic NUMA awareness
Date: 2025-07-09 16:35:13
Message-ID: uaqt63a2kz2g3qk2ujr55gdvmymnak3zsplcgrlh63ekbrytbc@a2wy22p3slge
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2025-07-02 14:36:31 +0200, Tomas Vondra wrote:
> On 7/2/25 13:37, Ashutosh Bapat wrote:
> > On Wed, Jul 2, 2025 at 12:37 AM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> >>
> >>
> >> 3) v1-0003-freelist-Don-t-track-tail-of-a-freelist.patch
> >>
> >> Minor optimization. Andres noticed we're tracking the tail of buffer
> >> freelist, without using it. So the patch removes that.
> >>
> >
> > The patches for resizing buffers use the lastFreeBuffer to add new
> > buffers to the end of free list when expanding it. But we could as
> > well add it at the beginning of the free list.

Yea, I don't see any point in adding buffers to the tail instead of to the
front. We probably want more recently used buffers at the front, since they
(and the associated BufferDesc) are more likely to be in a CPU cache.

> > This patch seems almost independent of the rest of the patches. Do you
> > need it in the rest of the patches? I understand that those patches
> > don't need to worry about maintaining lastFreeBuffer after this patch.
> > Is there any other effect?
> >
> > If we are going to do this, let's do it earlier so that buffer
> > resizing patches can be adjusted.
> >
>
> My patches don't particularly rely on this bit, it would work even with
> lastFreeBuffer. I believe Andres simply noticed the current code does
> not use lastFreeBuffer, it just maintains is, so he removed that as an
> optimization.

Optimiziation / simplification. When building multiple freelists it was
harder to maintain the tail pointer, and since it was never used...

+1 to just applying that part.

> I don't know how significant is the improvement, but if it's measurable we
> could just do that independently of our patches.

I doubt it's really an improvement in any realistic scenario, but it's also
not a regression in any way, since it's never used...

FWIW, I've started to wonder if we shouldn't just get rid of the freelist
entirely. While clocksweep is perhaps minutely slower in a single thread than
the freelist, clock sweep scales *considerably* better [1]. As it's rather
rare to be bottlenecked on clock sweep speed for a single thread (rather then
IO or memory copy overhead), I think it's worth favoring clock sweep.

Also needing to switch between getting buffers from the freelist and the sweep
makes the code more expensive. I think just having the buffer in the sweep,
with a refcount / usagecount of zero would suffice.

That seems particularly advantageous if we invest energy in making the clock
sweep deal well with NUMA systems, because we don't need have both a NUMA
aware freelist and a NUMA aware clock sweep.

Greetings,

Andres Freund

[1]

A single pg_prewarm of a large relation shows a difference between using the
freelist and not that's around the noise level, whereas 40 parallel
pg_prewarms of seperate relations is over 5x faster when disabling the
freelist.

For the test:
- I modified pg_buffercache_evict_* to put buffers onto the freelist

- Ensured all of shared buffers is allocated by querying
pg_shmem_allocations_numa, as otherwise the workload is dominated by the
kernel zeroing out buffers

- used shared_buffers bigger than the data

- data for single threaded is 9.7GB, data for the parallel case is 40
relations of 610MB each.

- in the single threaded case I pinned postgres to a single core, to make sure
core-to-core variation doesn't play a role

- single threaded case

c=1 && psql -Xq -c "select pg_buffercache_evict_all()" -c 'SELECT numa_node, sum(size), count(*) FROM pg_shmem_allocations_numa WHERE size != 0 GROUP BY numa_node;' && pgbench -n -P1 -c$c -j$c -f <(echo "SELECT pg_prewarm('copytest_large');") -t1

concurrent case:

c=40 && psql -Xq -c "select pg_buffercache_evict_all()" -c 'SELECT numa_node, sum(size), count(*) FROM pg_shmem_allocations_numa WHERE size != 0 GROUP BY numa_node;' && pgbench -n -P1 -c$c -j$c -f <(echo "SELECT pg_prewarm('copytest_:client_id');") -t1

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mircea Cadariu 2025-07-09 16:37:20 Re: analyze-in-stages post upgrade questions
Previous Message Tom Lane 2025-07-09 16:14:59 Re: 回复:[Internet]Re: [PATCH] Prevent replacement of a function if it's used in an index expression and is not IMMUTABLE