Re: Adding basic NUMA awareness

From: "Burd, Greg" <greg(at)burd(dot)me>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Tomas Vondra <tomas(at)vondra(dot)me>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Subject: Re: Adding basic NUMA awareness
Date: 2025-07-10 12:13:43
Message-ID: 628EE169-6901-466E-9191-B33DBAB05B26@burd.me
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Jul 9, 2025, at 1:23 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Hi,
>
> On 2025-07-09 12:55:51 -0400, Greg Burd wrote:
>> On Jul 9 2025, at 12:35 pm, Andres Freund <andres(at)anarazel(dot)de> wrote:
>>
>>> FWIW, I've started to wonder if we shouldn't just get rid of the freelist
>>> entirely. While clocksweep is perhaps minutely slower in a single
>>> thread than
>>> the freelist, clock sweep scales *considerably* better [1]. As it's rather
>>> rare to be bottlenecked on clock sweep speed for a single thread
>>> (rather then
>>> IO or memory copy overhead), I think it's worth favoring clock sweep.
>>
>> Hey Andres, thanks for spending time on this. I've worked before on
>> freelist implementations (last one in LMDB) and I think you're onto
>> something. I think it's an innovative idea and that the speed
>> difference will either be lost in the noise or potentially entirely
>> mitigated by avoiding duplicate work.
>
> Agreed. FWIW, just using clock sweep actually makes things like DROP TABLE
> perform better because it doesn't need to maintain the freelist anymore...
>
>
>>> Also needing to switch between getting buffers from the freelist and
>>> the sweep
>>> makes the code more expensive. I think just having the buffer in the sweep,
>>> with a refcount / usagecount of zero would suffice.
>>
>> If you're not already coding this, I'll jump in. :)
>
> My experimental patch is literally a four character addition ;), namely adding
> "0 &&" to the relevant code in StrategyGetBuffer().
>
> Obviously a real patch would need to do some more work than that. Feel free
> to take on that project, I am not planning on tackling that in near term.
>

I started on this last night, making good progress. Thanks for the inspiration. I'll create a new thread to track the work and cross-reference when I have something reasonable to show (hopefully later today).

> There's other things around this that could use some attention. It's not hard
> to see clock sweep be a bottleneck in concurrent workloads - partially due to
> the shared maintenance of the clock hand. A NUMAed clock sweep would address
> that.

Working on it. Other than NUMA-fying clocksweep there is a function have_free_buffer() that might be a tad tricky to re-implement efficiently and/or make NUMA aware. Or maybe I can remove that too? It is used in autoprewarm.c and possibly other extensions, but no where else in core.

> However, we also maintain StrategyControl->numBufferAllocs, which is a
> significant contention point and would not necessarily be removed by a
> NUMAificiation of the clock sweep.

Yep, I noted this counter and its potential for contention too. Fortunately, it seems like it is only used so that "bgwriter can estimate the rate of buffer consumption" which to me opens the door to a less accurate partitioned counter, perhaps something lock-free (no mutex/CAS) that is bucketed then combined when read.

A quick look at bufmgr.c indicates that recent_allocs (which is StrategyControl->numBufferAllocs) is used to track a "moving average" and other voodoo there I've yet to fully grok. Any thoughts on this approximate count approach?

Also, what are your thoughts on updating the algorithm to CLOCK-Pro [1] while I'm there? I guess I'd have to try it out, measure it a lot and see if there are any material benefits. Maybe I'll keep that for a future patch, or at least layer it... back to work!

> Greetings,
>
> Andres Freund

best.

-greg

[1] https://www.usenix.org/legacy/publications/library/proceedings/usenix05/tech/general/full_papers/jiang/jiang_html/html.html

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Nitin Motiani 2025-07-10 12:35:26 Re: Horribly slow pg_upgrade performance with many Large Objects
Previous Message Dilip Kumar 2025-07-10 12:04:28 Re: A recent message added to pg_upgade