Re: WIP: WAL prefetch (another approach)

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Jakub Wartak <Jakub(dot)Wartak(at)tomtom(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: WAL prefetch (another approach)
Date: 2021-02-10 21:50:33
Message-ID: 20210210215033.GA27507@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Greetings,

* Thomas Munro (thomas(dot)munro(at)gmail(dot)com) wrote:
> Rebase attached.

> Subject: [PATCH v15 4/6] Prefetch referenced blocks during recovery.
> diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
> index 4b60382778..ac27392053 100644
> --- a/doc/src/sgml/config.sgml
> +++ b/doc/src/sgml/config.sgml
> @@ -3366,6 +3366,64 @@ include_dir 'conf.d'
[...]
> + <varlistentry id="guc-recovery-prefetch-fpw" xreflabel="recovery_prefetch_fpw">
> + <term><varname>recovery_prefetch_fpw</varname> (<type>boolean</type>)
> + <indexterm>
> + <primary><varname>recovery_prefetch_fpw</varname> configuration parameter</primary>
> + </indexterm>
> + </term>
> + <listitem>
> + <para>
> + Whether to prefetch blocks that were logged with full page images,
> + during recovery. Often this doesn't help, since such blocks will not
> + be read the first time they are needed and might remain in the buffer

The "might" above seems slightly confusing- such blocks will remain in
shared buffers until/unless they're forced out, right?

> + pool after that. However, on file systems with a block size larger
> + than
> + <productname>PostgreSQL</productname>'s, prefetching can avoid a
> + costly read-before-write when a blocks are later written.
> + The default is off.

"when a blocks" above doesn't sound quite right, maybe reword this as:

"prefetching can avoid a costly read-before-write when WAL replay
reaches the block that needs to be written."

> diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
> index d1c3893b14..c51c431398 100644
> --- a/doc/src/sgml/wal.sgml
> +++ b/doc/src/sgml/wal.sgml
> @@ -720,6 +720,23 @@
> <acronym>WAL</acronym> call being logged to the server log. This
> option might be replaced by a more general mechanism in the future.
> </para>
> +
> + <para>
> + The <xref linkend="guc-recovery-prefetch"/> parameter can
> + be used to improve I/O performance during recovery by instructing
> + <productname>PostgreSQL</productname> to initiate reads
> + of disk blocks that will soon be needed but are not currently in
> + <productname>PostgreSQL</productname>'s buffer pool.
> + The <xref linkend="guc-maintenance-io-concurrency"/> and
> + <xref linkend="guc-wal-decode-buffer-size"/> settings limit prefetching
> + concurrency and distance, respectively. The
> + prefetching mechanism is most likely to be effective on systems
> + with <varname>full_page_writes</varname> set to
> + <varname>off</varname> (where that is safe), and where the working
> + set is larger than RAM. By default, prefetching in recovery is enabled
> + on operating systems that have <function>posix_fadvise</function>
> + support.
> + </para>
> </sect1>

> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c

> @@ -3697,7 +3699,6 @@ XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
> snprintf(activitymsg, sizeof(activitymsg), "waiting for %s",
> xlogfname);
> set_ps_display(activitymsg);
> -
> restoredFromArchive = RestoreArchivedFile(path, xlogfname,
> "RECOVERYXLOG",
> wal_segment_size,

> @@ -12566,6 +12585,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
> else
> havedata = false;
> }
> +
> if (havedata)
> {
> /*

Random whitespace change hunks..?

> diff --git a/src/backend/access/transam/xlogprefetch.c b/src/backend/access/transam/xlogprefetch.c

> + * The size of the queue is based on the maintenance_io_concurrency
> + * setting. In theory we might have a separate queue for each tablespace,
> + * but it's not clear how that should work, so for now we'll just use the
> + * general GUC to rate-limit all prefetching. The queue has space for up
> + * the highest possible value of the GUC + 1, because our circular buffer
> + * has a gap between head and tail when full.

Seems like "to" is missing- "The queue has space for up *to* the highest
possible value of the GUC + 1" ? Maybe also "between the head and the
tail when full".

> +/*
> + * Scan the current record for block references, and consider prefetching.
> + *
> + * Return true if we processed the current record to completion and still have
> + * queue space to process a new record, and false if we saturated the I/O
> + * queue and need to wait for recovery to advance before we continue.
> + */
> +static bool
> +XLogPrefetcherScanBlocks(XLogPrefetcher *prefetcher)
> +{
> + DecodedXLogRecord *record = prefetcher->record;
> +
> + Assert(!XLogPrefetcherSaturated(prefetcher));
> +
> + /*
> + * We might already have been partway through processing this record when
> + * our queue became saturated, so we need to start where we left off.
> + */
> + for (int block_id = prefetcher->next_block_id;
> + block_id <= record->max_block_id;
> + ++block_id)
> + {
> + DecodedBkpBlock *block = &record->blocks[block_id];
> + PrefetchBufferResult prefetch;
> + SMgrRelation reln;
> +
> + /* Ignore everything but the main fork for now. */
> + if (block->forknum != MAIN_FORKNUM)
> + continue;
> +
> + /*
> + * If there is a full page image attached, we won't be reading the
> + * page, so you might think we should skip it. However, if the
> + * underlying filesystem uses larger logical blocks than us, it
> + * might still need to perform a read-before-write some time later.
> + * Therefore, only prefetch if configured to do so.
> + */
> + if (block->has_image && !recovery_prefetch_fpw)
> + {
> + pg_atomic_unlocked_add_fetch_u64(&Stats->skip_fpw, 1);
> + continue;
> + }

FPIs in the stream aren't going to just avoid reads when the
filesystem's block size matches PG's- they're also going to avoid
subsequent modifications to the block, provided we don't end up pushing
that block out of shared buffers, rights?

That is, if you have an empty shared buffers and see:

Block 5 FPI
Block 6 FPI
Block 5 Update
Block 6 Update

it seems like, with this patch, we're going to Prefetch Block 5 & 6,
even though we almost certainly won't actually need them.

> + /* Fast path for repeated references to the same relation. */
> + if (RelFileNodeEquals(block->rnode, prefetcher->last_rnode))
> + {
> + /*
> + * If this is a repeat access to the same block, then skip it.
> + *
> + * XXX We could also check for last_blkno + 1 too, and also update
> + * last_blkno; it's not clear if the kernel would do a better job
> + * of sequential prefetching.
> + */
> + if (block->blkno == prefetcher->last_blkno)
> + {
> + pg_atomic_unlocked_add_fetch_u64(&Stats->skip_seq, 1);
> + continue;
> + }

I'm sure this will help with some cases, but it wouldn't help with the
case that I mention above, as I understand it.

> + {"recovery_prefetch", PGC_SIGHUP, WAL_SETTINGS,
> + gettext_noop("Prefetch referenced blocks during recovery"),
> + gettext_noop("Read ahead of the currenty replay position to find uncached blocks.")

extra 'y' at the end of 'current', and "find uncached blocks" might be
misleading, maybe:

"Read out ahead of the current replay position and prefetch blocks."

> diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
> index b7fb2ec1fe..4288f2f37f 100644
> --- a/src/backend/utils/misc/postgresql.conf.sample
> +++ b/src/backend/utils/misc/postgresql.conf.sample
> @@ -234,6 +234,12 @@
> #checkpoint_flush_after = 0 # measured in pages, 0 disables
> #checkpoint_warning = 30s # 0 disables
>
> +# - Prefetching during recovery -
> +
> +#wal_decode_buffer_size = 512kB # lookahead window used for prefetching
> +#recovery_prefetch = on # whether to prefetch pages logged with FPW
> +#recovery_prefetch_fpw = off # whether to prefetch pages logged with FPW

Think this was already mentioned, but the above comments shouldn't be
the same. :)

> From 2f6d690cefc0cad8cbd8b88dbed4d688399c6916 Mon Sep 17 00:00:00 2001
> From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
> Date: Mon, 14 Sep 2020 23:20:55 +1200
> Subject: [PATCH v15 5/6] WIP: Avoid extra buffer lookup when prefetching WAL
> blocks.
>
> Provide a some workspace in decoded WAL records, so that we can remember
> which buffer recently contained we found a block cached in, for later
> use when replaying the record. Provide a new way to look up a
> recently-known buffer and check if it's still valid and has the right
> tag.

"Provide a place in decoded WAL records to remember which buffer we
found a block cached in, to hopefully avoid having to look it up again
when we replay the record. Provide a way to look up a recently-known
buffer and check if it's still valid and has the right tag."

> XXX Needs review to figure out if it's safe or steamrolling over subtleties

... that's a great question. :) Not sure that I can really answer it
conclusively, but I can't think of any reason, given the buffer tag
check that's included, that it would be an issue. I'm glad to see this
though since it addresses some of the concern about this patch slowing
down replay in cases where there are FPIs and checkpoints are less than
the size of shared buffers, which seems much more common than cases
where FPIs have been disabled and/or checkpoints are larger than SB.
Further effort to avoid having likely-unnecessary prefetching done for
blocks which recently had an FPI would further reduce the risk of this
change slowing down replay for common deployments, though I'm not sure
how much of an impact that likely has or what the cost would be to avoid
the prefetching (and it's complicated by hot standby, I imagine...).

Thanks,

Stephen

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2021-02-10 21:56:17 Re: [HACKERS] Custom compression methods
Previous Message Jan Wieck 2021-02-10 21:32:21 Re: Extensibility of the PostgreSQL wire protocol