Re: WIP: WAL prefetch (another approach)

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Jakub Wartak <Jakub(dot)Wartak(at)tomtom(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: WAL prefetch (another approach)
Date: 2021-02-14 23:18:15
Message-ID: 20210214231815.GO27507@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Greetings,

* Tomas Vondra (tomas(dot)vondra(at)enterprisedb(dot)com) wrote:
> On 2/13/21 10:39 PM, Stephen Frost wrote:
> >* Andres Freund (andres(at)anarazel(dot)de) wrote:
> >>On 2021-02-12 00:42:04 +0100, Tomas Vondra wrote:
> >>>Yeah, that's a good point. I think it'd make sense to keep track of recent
> >>>FPIs and skip prefetching such blocks. But how exactly should we implement
> >>>that, how many blocks do we need to track? If you get an FPI, how long
> >>>should we skip prefetching of that block?
> >>>
> >>>I don't think the history needs to be very long, for two reasons. Firstly,
> >>>the usual pattern is that we have FPI + several changes for that block
> >>>shortly after it. Secondly, maintenance_io_concurrency limits this naturally
> >>>- after crossing that, redo should place the FPI into shared buffers,
> >>>allowing us to skip the prefetch.
> >>>
> >>>So I think using maintenance_io_concurrency is sufficient. We might track
> >>>more buffers to allow skipping prefetches of blocks that were evicted from
> >>>shared buffers, but that seems like an overkill.
> >>>
> >>>However, maintenance_io_concurrency can be quite high, so just a simple
> >>>queue is not very suitable - searching it linearly for each block would be
> >>>too expensive. But I think we can use a simple hash table, tracking
> >>>(relfilenode, block, LSN), over-sized to minimize collisions.
> >>>
> >>>Imagine it's a simple array with (2 * maintenance_io_concurrency) elements,
> >>>and whenever we prefetch a block or find an FPI, we simply add the block to
> >>>the array as determined by hash(relfilenode, block)
> >>>
> >>> hashtable[hash(...)] = {relfilenode, block, LSN}
> >>>
> >>>and then when deciding whether to prefetch a block, we look at that one
> >>>position. If the (relfilenode, block) match, we check the LSN and skip the
> >>>prefetch if it's sufficiently recent. Otherwise we prefetch.
> >>
> >>I'm a bit doubtful this is really needed at this point. Yes, the
> >>prefetching will do a buffer table lookup - but it's a lookup that
> >>already happens today. And the patch already avoids doing a second
> >>lookup after prefetching (by optimistically caching the last Buffer id,
> >>and re-checking).
> >
> >I agree that when a page is looked up, and found, in the buffer table
> >that the subsequent cacheing of the buffer id in the WAL records does a
> >good job of avoiding having to re-do that lookup. However, that isn't
> >the case which was being discussed here or what Tomas's suggestion was
> >intended to address.
> >
> >What I pointed out up-thread and what's being discussed here is what
> >happens when the WAL contains a few FPIs and a few regular WAL records
> >which are mixed up and not in ideal order. When that happens, with this
> >patch, the FPIs will be ignored, the regular WAL records will reference
> >blocks which aren't found in shared buffers (yet) and then we'll both
> >issue pre-fetches for those and end up having spent effort doing a
> >buffer lookup that we'll later re-do.
>
> The question is how common this pattern actually is - I don't know. As
> noted, the non-FPI would have to be fairly close to the FPI, i.e. within the
> wal_decode_buffer_size, to actually cause measurable harm.

Yeah, so it'll depend on how big wal_decode_buffer_size is. Increasing
that would certainly help to show if there ends up being a degredation
with this patch due to the extra prefetching being done.

> >To address the unnecessary syscalls we really just need to keep track of
> >any FPIs that we've seen between where the point where the prefetching
> >is happening and the point where the replay is being done- once replay
> >has replayed an FPI, our buffer lookup will succeed and we'll cache the
> >buffer that the FPI is at- in other words, only wal_decode_buffer_size
> >amount of WAL needs to be considered.
>
> Yeah, that's essentially what I proposed.

Glad I captured it correctly.

> >We could further leverage this tracking of FPIs, to skip the prefetch
> >syscalls, by cacheing what later records address the blocks that have
> >FPIs earlier in the queue with the FPI record and then when replay hits
> >the FPI and loads it into shared_buffers, it could update the other WAL
> >records in the queue with the buffer id of the page, allowing us to very
> >likely avoid having to do another lookup later on.
>
> This seems like an over-engineering, at least for v1.

Perhaps, though it didn't seem like it'd be very hard to do with the
already proposed changes to stash the buffer id in the WAL records.

> >>I think there's potential for some significant optimization going
> >>forward, but I think it's basically optimization over what we're doing
> >>today. As this is already a nontrivial patch, I'd argue for doing so
> >>separately.
> >
> >This seems like a great optimization, albeit a fair bit of code, for a
> >relatively uncommon use-case, specifically where full page writes are
> >disabled or very large checkpoints. As that's the case though, I would
> >think it's reasonable to ask that it go out of its way to avoid slowing
> >down the more common configurations, particularly since it's proposed to
> >have it on by default (which I agree with, provided it ends up improving
> >the common cases, which I think the suggestions above would certainly
> >make it more likely to do).
>
> I'm OK to do some benchmarking, but it's not quite clear to me why does it
> matter if the checkpoints are smaller than shared buffers? IMO what matters
> is how "localized" the updates are, i.e. how likely it is to hit the same
> page repeatedly (in a short amount of time). Regular pgbench is not very
> suitable for that, but some non-uniform distribution should do the trick, I
> think.

I suppose strictly speaking it'd be
Min(wal_decode_buffer_size,checkpoint_size), but yes, you're right that
it's more about the wal_decode_buffer_size than the checkpoint's size.
Apologies for the confusion. As suggested above, one way to benchmark
this to really see if there's any issue would be to increase
wal_decode_buffer_size to some pretty big size and then compare the
performance vs. unpatched. I'd think that could even be done with
pgbench, so you're not having to arrange for the same pages to get
updated over and over.

> >Perhaps this already improves the common cases and is worth the extra
> >code on that basis, but I don't recall seeing much in the way of
> >benchmarking in this thread for that case- that is, where FPIs are
> >enabled and checkpoints are smaller than shared buffers. Jakub's
> >testing was done with FPWs disabled and Tomas's testing used checkpoints
> >which were much larger than the size of shared buffers on the system
> >doing the replay. While it's certainly good that this patch improves
> >those cases, we should also be looking out for the worst case and make
> >sure that the patch doesn't degrade performance in that case.
>
> I'm with Andres on this. It's fine to leave some possible optimizations on
> the table for the future. And even if some workloads are affected
> negatively, it's still possible to disable the prefetching.

While I'm generally in favor of this argument, that a feature is
particularly important and that it's worth slowing down the common cases
to enable it, I dislike that it's applied inconsistently. I'd certainly
feel better about it if we had actual performance numbers to consider.
I don't doubt the possibility that the extra prefetch's just don't
amount to enough to matter but I have a hard time seeing them as not
having some cost and without actually measuring it, it's hard to say
what that cost is.

Without looking farther back than the last record, we could end up
repeatedly asking for the same blocks to be prefetched too-

FPI for block 1
FPI for block 2
WAL record for block 1
WAL record for block 2
WAL record for block 1
WAL record for block 2
WAL record for block 1
WAL record for block 2

... etc.

Entirely possible my math is off, but seems like the worst case
situation right now might end up with some 4500 unnecessary prefetch
syscalls even with the proposed default wal_decode_buffer_size of
512k and 56-byte WAL records ((524,288 - 16,384) / 56 / 2 = ~4534).

Issuing unnecessary prefetches for blocks we've already sent a prefetch
for is arguably a concern even if FPWs are off but the benefit of doing
the prefetching almost certainly will outweight that and mean that
finding a way to address it is something we could certainly do later as
a future improvement. I wouldn't have any issue with that. Just
doesn't seem as clear-cut to me when thinking about the FPW-enabled
case. Ultimately, if you, Andres and Munro are all not concerned about
it and no one else speaks up then I'm not going to pitch a fuss over it
being committed, but, as you said above, it seemed like a good point to
raise for everyone to consider.

Thanks,

Stephen

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2021-02-15 00:20:29 Re: GlobalVisIsRemovableFullXid() vs GlobalVisCheckRemovableXid()
Previous Message Tomas Vondra 2021-02-14 22:38:01 Re: WIP: WAL prefetch (another approach)