Re: WIP: WAL prefetch (another approach)

From: Julien Rouhaud <rjuju123(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Daniel Gustafsson <daniel(at)yesql(dot)se>, Stephen Frost <sfrost(at)snowman(dot)net>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, David Steele <david(at)pgmasters(dot)net>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Jakub Wartak <Jakub(dot)Wartak(at)tomtom(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: WAL prefetch (another approach)
Date: 2022-03-21 08:29:16
Message-ID: 20220321082916.ysrlw2xst2tpyc5a@jrouhaud
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Sun, Mar 20, 2022 at 05:36:38PM +1300, Thomas Munro wrote:
> On Fri, Mar 18, 2022 at 9:59 AM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> > I'll push 0001 today to let the build farm chew on it for a few days
> > before moving to 0002.
> Clearly is flapping and causing recoveryCheck to
> fail occasionally, but that predates the above commit. I didn't
> follow the existing discussion on that, so I'll try to look into that
> tomorrow.
> Here's a rebase of the 0002 patch, now called 0001

So I finally finished looking at this patch. Here again, AFAICS the feature is
working as expected and I didn't find any problem. I just have some minor
comments, like for the previous patch.

For the docs:

+ Whether to try to prefetch blocks that are referenced in the WAL that
+ are not yet in the buffer pool, during recovery. Valid values are
+ <literal>off</literal> (the default), <literal>on</literal> and
+ <literal>try</literal>. The setting <literal>try</literal> enables
+ prefetching only if the operating system provides the
+ <function>posix_fadvise</function> function, which is currently used
+ to implement prefetching. Note that some operating systems provide the
+ function, but don't actually perform any prefetching.

Is there any reason not to change it to try? I'm wondering if some system says
that the function exists but simply raise an error if you actually try to use
it. I think that at least WSL does that for some functions.

+ <para>
+ The <xref linkend="guc-recovery-prefetch"/> parameter can
+ be used to improve I/O performance during recovery by instructing
+ <productname>PostgreSQL</productname> to initiate reads
+ of disk blocks that will soon be needed but are not currently in
+ <productname>PostgreSQL</productname>'s buffer pool.
+ The <xref linkend="guc-maintenance-io-concurrency"/> and
+ <xref linkend="guc-wal-decode-buffer-size"/> settings limit prefetching
+ concurrency and distance, respectively.
+ By default, prefetching in recovery is disabled.
+ </para>

I think that "improving I/O performance" is a bit misleading, maybe reduce I/O
wait time or something like that? Also, I don't know if we need to be that
precise, but maybe we should say that it's the underlying kernel that will
(asynchronously) initiate the reads, and postgres will simply notifies it.

+ <para>
+ The <structname>pg_stat_prefetch_recovery</structname> view will contain only
+ one row. It is filled with nulls if recovery is not running or WAL
+ prefetching is not enabled. See <xref linkend="guc-recovery-prefetch"/>
+ for more information.
+ </para>

That's not the implemented behavior as far as I can see. It just prints whatever is in SharedStats
regardless of the recovery state or the prefetch_wal setting (assuming that
there's no pending reset request). Similarly, there's a mention that
pg_stat_reset_shared('wal') will reset the stats, but I don't see anything
calling XLogPrefetchRequestResetStats().

Finally, I think we should documented what are the cumulated counters in that
view (that should get reset) and the dynamic counters (that shouldn't get

For the code:

XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
RelFileNode *rnode, ForkNumber *forknum, BlockNumber *blknum)
+ return XLogRecGetBlockInfo(record, block_id, rnode, forknum, blknum, NULL);
+XLogRecGetBlockInfo(XLogReaderState *record, uint8 block_id,
+ RelFileNode *rnode, ForkNumber *forknum,
+ BlockNumber *blknum,
+ Buffer *prefetch_buffer)

It's missing comments on that function. XLogRecGetBlockTag comments should
probably be reworded at the same time.

+ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
bool fetching_ckpt, TimeLineID replayTLI)
XLogRecord *record;
+ XLogReaderState *xlogreader = XLogPrefetcherReader(xlogprefetcher);

nit: maybe name it XLogPrefetcherGetReader()?

* containing it (if not open already), and returns true. When end of standby
* mode is triggered by the user, and there is no more WAL available, returns
* false.
+ *
+ * If nonblocking is true, then give up immediately if we can't satisfy the
+ * request, returning XLREAD_WOULDBLOCK instead of waiting.
-static bool
+static XLogPageReadResult
WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,

The comment still mentions a couple of time returning true/false rather than
XLREAD_*, same for at least XLogPageRead().

@@ -3350,6 +3392,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
if (lastSourceFailed)
+ /*
+ * Don't allow any retry loops to occur during nonblocking
+ * readahead. Let the caller process everything that has been
+ * decoded already first.
+ */
+ if (nonblocking)

Is that really enough? I'm wondering if the code path in ReadRecord() that
forces lastSourceFailed to False while it actually failed when switching into
archive recovery (xlogrecovery.c around line 3044) can be problematic here.

{"wal_decode_buffer_size", PGC_POSTMASTER, WAL_ARCHIVE_RECOVERY,
gettext_noop("Maximum buffer size for reading ahead in the WAL during recovery."),
gettext_noop("This controls the maximum distance we can read ahead in the WAL to prefetch referenced blocks."),
512 * 1024, 64 * 1024, INT_MAX,

Should the max be MaxAllocSize?

+ /* Do we have a clue where the buffer might be already? */
+ if (BufferIsValid(recent_buffer) &&
+ mode == RBM_NORMAL &&
+ ReadRecentBuffer(rnode, forknum, blkno, recent_buffer))
+ {
+ buffer = recent_buffer;
+ goto recent_buffer_fast_path;
+ }

Should this increment (local|shared)_blks_hit, since ReadRecentBuffer doesn't?

Missed in the previous patch: XLogDecodeNextRecord() isn't a trivial function,
so some comments would be helpful.


+ * data. XLogRecBufferForRedo() cooperates uses information stored in the
+ * decoded record to find buffers efficiently.

I'm not sure what you wanted to say here. Also, I don't see any
XLogRecBufferForRedo() anywhere, I'm assuming it's

+ * A callback that reads ahead in the WAL and tries to initiate one IO.
+ */
+static LsnReadQueueNextStatus
+XLogPrefetcherNextBlock(uintptr_t pgsr_private, XLogRecPtr *lsn)

Should there be a bit more comments about what this function is supposed to

I'm wondering if it's a bit overkill to implement this as a callback. Do you
have near future use cases in mind? For now no other code could use the
infrastructure at all as the lrq is private, so some changes will be needed to
make it truly configurable anyway.

If we keep it as a callback, I think it would make sense to extract some part,
like the main prefetch filters / global-limit logic, so other possible
implementations can use it if needed. It would also help to reduce this
function a bit, as it's somewhat long.

Also, about those filters:

+ if (rmid == RM_XLOG_ID)
+ {
+ if (record_type == XLOG_CHECKPOINT_SHUTDOWN ||
+ record_type == XLOG_END_OF_RECOVERY)
+ {
+ /*
+ * These records might change the TLI. Avoid potential
+ * bugs if we were to allow "read TLI" and "replay TLI" to
+ * differ without more analysis.
+ */
+ prefetcher->no_readahead_until = record->lsn;
+ }
+ }

Should there be a note that it's still ok to process this record in the loop
just after, as it won't contain any prefetchable data, or simply jump to the
end of that loop?

+ * Increment a counter in shared memory. This is equivalent to *counter++ on a
+ * plain uint64 without any memory barrier or locking, except on platforms
+ * where readers can't read uint64 without possibly observing a torn value.
+ */
+static inline void
+XLogPrefetchIncrement(pg_atomic_uint64 *counter)
+ Assert(AmStartupProcess() || !IsUnderPostmaster);
+ pg_atomic_write_u64(counter, pg_atomic_read_u64(counter) + 1);

I'm curious about this one. Is it to avoid expensive locking on platforms that
don't have a lockless pg_atomic_fetch_add_u64?

Also, it's only correct because there can only be a single prefetcher, so you
can't have concurrent increment of the same counter right?


This function could use the new SetSingleFuncCall() function introduced in

And finally:

diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 4cf5b26a36..0a6c7bd83e 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -241,6 +241,11 @@
#max_wal_size = 1GB
#min_wal_size = 80MB

+# - Prefetching during recovery -
+#wal_decode_buffer_size = 512kB # lookahead window used for prefetching

This one should be documented as "(change requires restart)"

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Aleksander Alekseev 2022-03-21 08:52:14 [PATCH] Remove workarounds to format [u]int64's
Previous Message Dilip Kumar 2022-03-21 08:13:34 Re: Support logical replication of DDLs