Re: WIP: WAL prefetch (another approach)

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Jakub Wartak <Jakub(dot)Wartak(at)tomtom(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: WAL prefetch (another approach)
Date: 2021-02-11 23:42:04
Message-ID: 103ec76d-8bbc-09e1-460c-8785751431b5@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2/10/21 10:50 PM, Stephen Frost wrote:
>
> ...
>
>> +/*
>> + * Scan the current record for block references, and consider prefetching.
>> + *
>> + * Return true if we processed the current record to completion and still have
>> + * queue space to process a new record, and false if we saturated the I/O
>> + * queue and need to wait for recovery to advance before we continue.
>> + */
>> +static bool
>> +XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher)
>> +{
>> + DecodedXLogRecord *record = prefetcher->record;
>> +
>> + Assert(!XLogPrefetcherSaturated(prefetcher));
>> +
>> + /*
>> + * We might already have been partway through processing this record when
>> + * our queue became saturated, so we need to start where we left off.
>> + */
>> + for (int block_id = prefetcher->next_block_id;
>> + block_id <= record->max_block_id;
>> + ++block_id)
>> + {
>> + DecodedBkpBlock *block = &record->blocks[block_id];
>> + PrefetchBufferResult prefetch;
>> + SMgrRelation reln;
>> +
>> + /* Ignore everything but the main fork for now. */
>> + if (block->forknum != MAIN_FORKNUM)
>> + continue;
>> +
>> + /*
>> + * If there is a full page image attached, we won't be reading the
>> + * page, so you might think we should skip it. However, if the
>> + * underlying filesystem uses larger logical blocks than us, it
>> + * might still need to perform a read-before-write some time later.
>> + * Therefore, only prefetch if configured to do so.
>> + */
>> + if (block->has_image && !recovery_prefetch_fpw)
>> + {
>> + pg_atomic_unlocked_add_fetch_u64(&Stats->skip_fpw, 1);
>> + continue;
>> + }
>
> FPIs in the stream aren't going to just avoid reads when the
> filesystem's block size matches PG's- they're also going to avoid
> subsequent modifications to the block, provided we don't end up pushing
> that block out of shared buffers, rights?
>
> That is, if you have an empty shared buffers and see:
>
> Block 5 FPI
> Block 6 FPI
> Block 5 Update
> Block 6 Update
>
> it seems like, with this patch, we're going to Prefetch Block 5 & 6,
> even though we almost certainly won't actually need them.
>

Yeah, that's a good point. I think it'd make sense to keep track of
recent FPIs and skip prefetching such blocks. But how exactly should we
implement that, how many blocks do we need to track? If you get an FPI,
how long should we skip prefetching of that block?

I don't think the history needs to be very long, for two reasons.
Firstly, the usual pattern is that we have FPI + several changes for
that block shortly after it. Secondly, maintenance_io_concurrency limits
this naturally - after crossing that, redo should place the FPI into
shared buffers, allowing us to skip the prefetch.

So I think using maintenance_io_concurrency is sufficient. We might
track more buffers to allow skipping prefetches of blocks that were
evicted from shared buffers, but that seems like an overkill.

However, maintenance_io_concurrency can be quite high, so just a simple
queue is not very suitable - searching it linearly for each block would
be too expensive. But I think we can use a simple hash table, tracking
(relfilenode, block, LSN), over-sized to minimize collisions.

Imagine it's a simple array with (2 * maintenance_io_concurrency)
elements, and whenever we prefetch a block or find an FPI, we simply add
the block to the array as determined by hash(relfilenode, block)

hashtable[hash(...)] = {relfilenode, block, LSN}

and then when deciding whether to prefetch a block, we look at that one
position. If the (relfilenode, block) match, we check the LSN and skip
the prefetch if it's sufficiently recent. Otherwise we prefetch.

We may issue some extra prefetches due to collisions, but that's fine I
think. There should not be very many of them, thanks to having the hash
table oversized.

The good thing is this is quite simple, fixed-sized data structure,
there's no need for allocations etc.

>> + /* Fast path for repeated references to the same relation. */
>> + if (RelFileNodeEquals(block->rnode, prefetcher->last_rnode))
>> + {
>> + /*
>> + * If this is a repeat access to the same block, then skip it.
>> + *
>> + * XXX We could also check for last_blkno + 1 too, and also update
>> + * last_blkno; it's not clear if the kernel would do a better job
>> + * of sequential prefetching.
>> + */
>> + if (block->blkno == prefetcher->last_blkno)
>> + {
>> + pg_atomic_unlocked_add_fetch_u64(&Stats->skip_seq, 1);
>> + continue;
>> + }
>
> I'm sure this will help with some cases, but it wouldn't help with the
> case that I mention above, as I understand it.
>

It won't but it's a pretty effective check. I've done some experiments
recently, and with random pgbench this eliminates ~15% of prefetches.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2021-02-12 01:02:01 Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)
Previous Message Corey Huinker 2021-02-11 23:36:49 Re: parse_slash_copy doesn't support psql variables substitution