From: | Tomas Vondra <tomas(at)vondra(dot)me> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Adding basic NUMA awareness |
Date: | 2025-08-13 16:36:17 |
Message-ID: | 874435bd-5e25-4a7c-a5d5-8ef0c262d788@vondra.me |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 8/13/25 17:16, Andres Freund wrote:
> Hi,
>
> On 2025-08-07 11:24:18 +0200, Tomas Vondra wrote:
>> The patch does a much simpler thing - treat the weight as a "budget",
>> i.e. number of buffers to allocate before proceeding to the "next"
>> partition. So it allocates 55 buffers from P1, then 45 buffers from P2,
>> and then goes back to P1 in a round-robin way. The advantage is it can
>> do away without a PRNG.
>
> I think that's a good plan.
>
>
> A few comments about the clock sweep patch:
>
> - It'd be easier to review if BgBufferSync() weren't basically re-indented
> wholesale. Maybe you could instead move the relevant code to a helper
> function that's called by BgBufferSync() for each clock?
>
True, I'll rework it like that.
> - I think choosing a clock sweep partition in every tick would likely show up
> in workloads that do a lot of buffer replacement, particularly if buffers
> in the workload often have a high usagecount (and thus more ticks are used).
> Given that your balancing approach "sticks" with a partition for a while,
> could we perhaps only choose the partition after exhausting that budget?
>
That should be possible, yes. By "exhausting budget" you mean going
through all the partitions, right?
> - I don't really understand what
>
>> + /*
>> + * Buffers that should have been allocated in this partition (but might
>> + * have been redirected to keep allocations balanced).
>> + */
>> + pg_atomic_uint32 numRequestedAllocs;
>> +
>
> is intended for.
>
> Adding yet another atomic increment for every clock sweep tick seems rather
> expensive...
>
For the balancing (to calculate the budgets), we need to know the number
of allocation requests for each partition, before some of the requests
got redirected to other partitions. We can't use the number of "actual"
allocations. But it seems useful to have both - one to calculate the
budgets, the other to monitor how balanced the result is.
I haven't seen the extra atomic in profiles, even on workloads that do a
lot of buffer allocations (e.g. seqscan with datasets > shared buffers).
But if that happens, I think there are ways to mitigate that.
>
> - I wonder if the balancing budgets being relatively low will be good
> enough. It's not too hard to imagine that this frequent "partition choosing"
> will be bad in buffer access heavy workloads. But it's probably the right
> approach until we've measured it being a problem.
>
I don't follow. How would making the budgets higher change any of this?
Anyway, I think choosing the partitions less frequently - e.g. only
after consuming budget for the current partition, or going "full cycle",
would make this a non-issue.
>
> - It'd be interesting to do some very simple evaluation like a single
> pg_prewarm() of a relation that's close to the size of shared buffers and
> verify that we don't end up evicting newly read in buffers. I think your
> approach should work, but verifying that...
>
Will try.
> I wonder if we could make some of this into tests somehow. It's pretty easy
> to break this kind of thing and not notice, as everything just continues to
> work, just a tad slower.
>
Do you mean a test that'd be a part of make check, or a standalone test?
AFAICS any meaningful test would need to be fairly expensive, so
probably not a good fit for make check.
regards
--
Tomas Vondra
From | Date | Subject | |
---|---|---|---|
Next Message | Masahiko Sawada | 2025-08-13 16:49:46 | Re: Dropping publication breaks logical replication |
Previous Message | Peter Geoghegan | 2025-08-13 16:36:01 | Re: index prefetching |