Re: WIP: WAL prefetch (another approach)

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Jakub Wartak <Jakub(dot)Wartak(at)tomtom(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: WAL prefetch (another approach)
Date: 2021-03-17 16:12:17
Message-ID: 17b3f450-048c-f3a6-7086-259523f33cbc@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2/15/21 12:18 AM, Stephen Frost wrote:
> Greetings,
>
> ...
>
>>>> I think there's potential for some significant optimization going
>>>> forward, but I think it's basically optimization over what we're doing
>>>> today. As this is already a nontrivial patch, I'd argue for doing so
>>>> separately.
>>>
>>> This seems like a great optimization, albeit a fair bit of code, for a
>>> relatively uncommon use-case, specifically where full page writes are
>>> disabled or very large checkpoints. As that's the case though, I would
>>> think it's reasonable to ask that it go out of its way to avoid slowing
>>> down the more common configurations, particularly since it's proposed to
>>> have it on by default (which I agree with, provided it ends up improving
>>> the common cases, which I think the suggestions above would certainly
>>> make it more likely to do).
>>
>> I'm OK to do some benchmarking, but it's not quite clear to me why does it
>> matter if the checkpoints are smaller than shared buffers? IMO what matters
>> is how "localized" the updates are, i.e. how likely it is to hit the same
>> page repeatedly (in a short amount of time). Regular pgbench is not very
>> suitable for that, but some non-uniform distribution should do the trick, I
>> think.
>
> I suppose strictly speaking it'd be
> Min(wal_decode_buffer_size,checkpoint_size), but yes, you're right that
> it's more about the wal_decode_buffer_size than the checkpoint's size.
> Apologies for the confusion. As suggested above, one way to benchmark
> this to really see if there's any issue would be to increase
> wal_decode_buffer_size to some pretty big size and then compare the
> performance vs. unpatched. I'd think that could even be done with
> pgbench, so you're not having to arrange for the same pages to get
> updated over and over.
>

What exactly would be the point of such benchmark? I don't think the
patch does prefetching based on wal_decode_buffer_size, that just says
how far ahead we decode - the prefetch distance I is defined by
maintenance_io_concurrency.

But it's not clear to me what exactly would the result say about the
necessity of the optimization at hand (skipping prefetches for blocks
with recent FPI). If the the maintenance_io_concurrency is very high,
the probability that a block is evicted prematurely grows, making the
prefetch useless in general. How does this say anything about the
problem at hand? Sure, we'll do unnecessary I/O, causing issues, but
that's a bit like complaining the engine gets very hot when driving on a
highway in reverse.

AFAICS to measure the worst case, you'd need a workload with a lot of
FPIs, and very little actual I/O. That means, data set that fits into
memory (either shared buffers or RAM), and short checkpoints. But that's
exactly the case where you don't need prefetching ...

>>> Perhaps this already improves the common cases and is worth the extra
>>> code on that basis, but I don't recall seeing much in the way of
>>> benchmarking in this thread for that case- that is, where FPIs are
>>> enabled and checkpoints are smaller than shared buffers. Jakub's
>>> testing was done with FPWs disabled and Tomas's testing used checkpoints
>>> which were much larger than the size of shared buffers on the system
>>> doing the replay. While it's certainly good that this patch improves
>>> those cases, we should also be looking out for the worst case and make
>>> sure that the patch doesn't degrade performance in that case.
>>
>> I'm with Andres on this. It's fine to leave some possible optimizations on
>> the table for the future. And even if some workloads are affected
>> negatively, it's still possible to disable the prefetching.
>
> While I'm generally in favor of this argument, that a feature is
> particularly important and that it's worth slowing down the common cases
> to enable it, I dislike that it's applied inconsistently. I'd certainly

If you have a workload where this happens to cause issues, you can just
disable that. IMHO that's a perfectly reasonable engineering approach,
where we get something that significantly improves 80% of the cases,
allow disabling it for cases where it might cause issues, and then
improve it in the next version.

> feel better about it if we had actual performance numbers to consider.
> I don't doubt the possibility that the extra prefetch's just don't
> amount to enough to matter but I have a hard time seeing them as not
> having some cost and without actually measuring it, it's hard to say
> what that cost is.
>
> Without looking farther back than the last record, we could end up
> repeatedly asking for the same blocks to be prefetched too-
>
> FPI for block 1
> FPI for block 2
> WAL record for block 1
> WAL record for block 2
> WAL record for block 1
> WAL record for block 2
> WAL record for block 1
> WAL record for block 2
>
> ... etc.
>
> Entirely possible my math is off, but seems like the worst case
> situation right now might end up with some 4500 unnecessary prefetch
> syscalls even with the proposed default wal_decode_buffer_size of
> 512k and 56-byte WAL records ((524,288 - 16,384) / 56 / 2 = ~4534).
>

Well, that's a bit extreme workload, I guess. If you really have such
long streaks of WAL records touching the same small set of blocks, you
don't need WAL prefetching at all and you can just disable it. Easy.

If you have workload with small active set, frequent checkpoint etc.
then just don't enable WAL prefetching. What's wrong with that?

> Issuing unnecessary prefetches for blocks we've already sent a prefetch
> for is arguably a concern even if FPWs are off but the benefit of doing
> the prefetching almost certainly will outweight that and mean that
> finding a way to address it is something we could certainly do later as
> a future improvement. I wouldn't have any issue with that. Just
> doesn't seem as clear-cut to me when thinking about the FPW-enabled
> case. Ultimately, if you, Andres and Munro are all not concerned about
> it and no one else speaks up then I'm not going to pitch a fuss over it
> being committed, but, as you said above, it seemed like a good point to
> raise for everyone to consider.
>

Right, I was just going to point out the FPIs are not necessary - what
matters is the presence of long streaks of WAL records touching the same
set of blocks. But people with workloads where this is common likely
don't need the WAL prefetching at all - the replica can keep up just
fine, because it doesn't need to do much I/O anyway (and if it can't
then prefetching won't help much anyway). So just don't enable the
prefetching, and there'll be no overhead.

If it was up to me, I'd just get the patch committed as is. Delaying the
feature because of concerns that it might have some negative effect in
some cases, when that can be simply mitigated by disabling the feature,
is not really beneficial for our users.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2021-03-17 16:13:23 Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?
Previous Message Pavel Stehule 2021-03-17 16:04:48 Re: pl/pgsql feature request: shorthand for argument and local variable references