Quick Links

Re: Adding basic NUMA awareness

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Adding basic NUMA awareness
Date:	2025-08-13 16:36:17
Message-ID:	874435bd-5e25-4a7c-a5d5-8ef0c262d788@vondra.me
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 8/13/25 17:16, Andres Freund wrote:
> Hi,
>
> On 2025-08-07 11:24:18 +0200, Tomas Vondra wrote:
>> The patch does a much simpler thing - treat the weight as a "budget",
>> i.e. number of buffers to allocate before proceeding to the "next"
>> partition. So it allocates 55 buffers from P1, then 45 buffers from P2,
>> and then goes back to P1 in a round-robin way. The advantage is it can
>> do away without a PRNG.
>
> I think that's a good plan.
>
>
> A few comments about the clock sweep patch:
>
> - It'd be easier to review if BgBufferSync() weren't basically re-indented
> wholesale. Maybe you could instead move the relevant code to a helper
> function that's called by BgBufferSync() for each clock?
>

True, I'll rework it like that.

> - I think choosing a clock sweep partition in every tick would likely show up
> in workloads that do a lot of buffer replacement, particularly if buffers
> in the workload often have a high usagecount (and thus more ticks are used).
> Given that your balancing approach "sticks" with a partition for a while,
> could we perhaps only choose the partition after exhausting that budget?
>

That should be possible, yes. By "exhausting budget" you mean going
through all the partitions, right?

> - I don't really understand what
>
>> + /*
>> + * Buffers that should have been allocated in this partition (but might
>> + * have been redirected to keep allocations balanced).
>> + */
>> + pg_atomic_uint32 numRequestedAllocs;
>> +
>
> is intended for.
>
> Adding yet another atomic increment for every clock sweep tick seems rather
> expensive...
>

For the balancing (to calculate the budgets), we need to know the number
of allocation requests for each partition, before some of the requests
got redirected to other partitions. We can't use the number of "actual"
allocations. But it seems useful to have both - one to calculate the
budgets, the other to monitor how balanced the result is.

I haven't seen the extra atomic in profiles, even on workloads that do a
lot of buffer allocations (e.g. seqscan with datasets > shared buffers).
But if that happens, I think there are ways to mitigate that.

>
> - I wonder if the balancing budgets being relatively low will be good
> enough. It's not too hard to imagine that this frequent "partition choosing"
> will be bad in buffer access heavy workloads. But it's probably the right
> approach until we've measured it being a problem.
>

I don't follow. How would making the budgets higher change any of this?

Anyway, I think choosing the partitions less frequently - e.g. only
after consuming budget for the current partition, or going "full cycle",
would make this a non-issue.

>
> - It'd be interesting to do some very simple evaluation like a single
> pg_prewarm() of a relation that's close to the size of shared buffers and
> verify that we don't end up evicting newly read in buffers. I think your
> approach should work, but verifying that...
>

Will try.

> I wonder if we could make some of this into tests somehow. It's pretty easy
> to break this kind of thing and not notice, as everything just continues to
> work, just a tad slower.
>

Do you mean a test that'd be a part of make check, or a standalone test?
AFAICS any meaningful test would need to be fairly expensive, so
probably not a good fit for make check.

regards

--
Tomas Vondra

In response to

Re: Adding basic NUMA awareness at 2025-08-13 15:16:24 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Masahiko Sawada	2025-08-13 16:49:46	Re: Dropping publication breaks logical replication
Previous Message	Peter Geoghegan	2025-08-13 16:36:01	Re: index prefetching