Re: WIP: WAL prefetch (another approach)

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>, Andres Freund <andres(at)anarazel(dot)de>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Jakub Wartak <Jakub(dot)Wartak(at)tomtom(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: WAL prefetch (another approach)
Date: 2021-02-14 22:38:01
Message-ID: eed83eaa-91e1-91fe-32d0-ece36c0fa810@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2/13/21 10:39 PM, Stephen Frost wrote:
> Greetings,
>
> * Andres Freund (andres(at)anarazel(dot)de) wrote:
>> On 2021-02-12 00:42:04 +0100, Tomas Vondra wrote:
>>> Yeah, that's a good point. I think it'd make sense to keep track of recent
>>> FPIs and skip prefetching such blocks. But how exactly should we implement
>>> that, how many blocks do we need to track? If you get an FPI, how long
>>> should we skip prefetching of that block?
>>>
>>> I don't think the history needs to be very long, for two reasons. Firstly,
>>> the usual pattern is that we have FPI + several changes for that block
>>> shortly after it. Secondly, maintenance_io_concurrency limits this naturally
>>> - after crossing that, redo should place the FPI into shared buffers,
>>> allowing us to skip the prefetch.
>>>
>>> So I think using maintenance_io_concurrency is sufficient. We might track
>>> more buffers to allow skipping prefetches of blocks that were evicted from
>>> shared buffers, but that seems like an overkill.
>>>
>>> However, maintenance_io_concurrency can be quite high, so just a simple
>>> queue is not very suitable - searching it linearly for each block would be
>>> too expensive. But I think we can use a simple hash table, tracking
>>> (relfilenode, block, LSN), over-sized to minimize collisions.
>>>
>>> Imagine it's a simple array with (2 * maintenance_io_concurrency) elements,
>>> and whenever we prefetch a block or find an FPI, we simply add the block to
>>> the array as determined by hash(relfilenode, block)
>>>
>>> hashtable[hash(...)] = {relfilenode, block, LSN}
>>>
>>> and then when deciding whether to prefetch a block, we look at that one
>>> position. If the (relfilenode, block) match, we check the LSN and skip the
>>> prefetch if it's sufficiently recent. Otherwise we prefetch.
>>
>> I'm a bit doubtful this is really needed at this point. Yes, the
>> prefetching will do a buffer table lookup - but it's a lookup that
>> already happens today. And the patch already avoids doing a second
>> lookup after prefetching (by optimistically caching the last Buffer id,
>> and re-checking).
>
> I agree that when a page is looked up, and found, in the buffer table
> that the subsequent cacheing of the buffer id in the WAL records does a
> good job of avoiding having to re-do that lookup. However, that isn't
> the case which was being discussed here or what Tomas's suggestion was
> intended to address.
>
> What I pointed out up-thread and what's being discussed here is what
> happens when the WAL contains a few FPIs and a few regular WAL records
> which are mixed up and not in ideal order. When that happens, with this
> patch, the FPIs will be ignored, the regular WAL records will reference
> blocks which aren't found in shared buffers (yet) and then we'll both
> issue pre-fetches for those and end up having spent effort doing a
> buffer lookup that we'll later re-do.
>

The question is how common this pattern actually is - I don't know. As
noted, the non-FPI would have to be fairly close to the FPI, i.e. within
the wal_decode_buffer_size, to actually cause measurable harm.

> To address the unnecessary syscalls we really just need to keep track of
> any FPIs that we've seen between where the point where the prefetching
> is happening and the point where the replay is being done- once replay
> has replayed an FPI, our buffer lookup will succeed and we'll cache the
> buffer that the FPI is at- in other words, only wal_decode_buffer_size
> amount of WAL needs to be considered.
>

Yeah, that's essentially what I proposed.

> We could further leverage this tracking of FPIs, to skip the prefetch
> syscalls, by cacheing what later records address the blocks that have
> FPIs earlier in the queue with the FPI record and then when replay hits
> the FPI and loads it into shared_buffers, it could update the other WAL
> records in the queue with the buffer id of the page, allowing us to very
> likely avoid having to do another lookup later on.
>

This seems like an over-engineering, at least for v1.

>> I think there's potential for some significant optimization going
>> forward, but I think it's basically optimization over what we're doing
>> today. As this is already a nontrivial patch, I'd argue for doing so
>> separately.
>
> This seems like a great optimization, albeit a fair bit of code, for a
> relatively uncommon use-case, specifically where full page writes are
> disabled or very large checkpoints. As that's the case though, I would
> think it's reasonable to ask that it go out of its way to avoid slowing
> down the more common configurations, particularly since it's proposed to
> have it on by default (which I agree with, provided it ends up improving
> the common cases, which I think the suggestions above would certainly
> make it more likely to do).
>

I'm OK to do some benchmarking, but it's not quite clear to me why does
it matter if the checkpoints are smaller than shared buffers? IMO what
matters is how "localized" the updates are, i.e. how likely it is to hit
the same page repeatedly (in a short amount of time). Regular pgbench is
not very suitable for that, but some non-uniform distribution should do
the trick, I think.

> Perhaps this already improves the common cases and is worth the extra
> code on that basis, but I don't recall seeing much in the way of
> benchmarking in this thread for that case- that is, where FPIs are
> enabled and checkpoints are smaller than shared buffers. Jakub's
> testing was done with FPWs disabled and Tomas's testing used checkpoints
> which were much larger than the size of shared buffers on the system
> doing the replay. While it's certainly good that this patch improves
> those cases, we should also be looking out for the worst case and make
> sure that the patch doesn't degrade performance in that case.
>

I'm with Andres on this. It's fine to leave some possible optimizations
on the table for the future. And even if some workloads are affected
negatively, it's still possible to disable the prefetching.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2021-02-14 23:18:15 Re: WIP: WAL prefetch (another approach)
Previous Message Peter Geoghegan 2021-02-14 22:33:07 Re: GlobalVisIsRemovableFullXid() vs GlobalVisCheckRemovableXid()