Re: Improve WALRead() to suck data directly from WAL buffers when possible

From: Andres Freund <andres(at)anarazel(dot)de>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Subject: Re: Improve WALRead() to suck data directly from WAL buffers when possible
Date: 2023-11-04 00:43:48
Message-ID: 20231104004348.zhxzk2p266i3r2z2@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2023-11-02 22:38:38 +0530, Bharath Rupireddy wrote:
> From 5b5469d7dcd8e98bfcaf14227e67356bbc1f5fe8 Mon Sep 17 00:00:00 2001
> From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
> Date: Thu, 2 Nov 2023 15:10:51 +0000
> Subject: [PATCH v14] Track oldest initialized WAL buffer page
>
> ---
> src/backend/access/transam/xlog.c | 170 ++++++++++++++++++++++++++++++
> src/include/access/xlog.h | 1 +
> 2 files changed, 171 insertions(+)
>
> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
> index b541be8eec..fdf2ef310b 100644
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -504,6 +504,45 @@ typedef struct XLogCtlData
> XLogRecPtr *xlblocks; /* 1st byte ptr-s + XLOG_BLCKSZ */
> int XLogCacheBlck; /* highest allocated xlog buffer index */
>
> + /*
> + * Start address of oldest initialized page in XLog buffers.
> + *
> + * We mainly track oldest initialized page explicitly to quickly tell if a
> + * given WAL record is available in XLog buffers. It also can be used for
> + * other purposes, see notes below.
> + *
> + * OldestInitializedPage gives XLog buffers following properties:
> + *
> + * 1) At any given point of time, pages in XLog buffers array are sorted
> + * in an ascending order from OldestInitializedPage till InitializedUpTo.
> + * Note that we verify this property for assert-only builds, see
> + * IsXLogBuffersArraySorted() for more details.

This is true - but also not, if you look at it a bit too literally. The
buffers in xlblocks itself obviously aren't ordered when wrapping around
between XLogRecPtrToBufIdx(OldestInitializedPage) and
XLogRecPtrToBufIdx(InitializedUpTo).

> + * 2) OldestInitializedPage is monotonically increasing (by virtue of how
> + * postgres generates WAL records), that is, its value never decreases.
> + * This property lets someone read its value without a lock. There's no
> + * problem even if its value is slightly stale i.e. concurrently being
> + * updated. One can still use it for finding if a given WAL record is
> + * available in XLog buffers. At worst, one might get false positives
> + * (i.e. OldestInitializedPage may tell that the WAL record is available
> + * in XLog buffers, but when one actually looks at it, it isn't really
> + * available). This is more efficient and performant than acquiring a lock
> + * for reading. Note that we may not need a lock to read
> + * OldestInitializedPage but we need to update it holding
> + * WALBufMappingLock.

I'd
s/may not need/do not need/

But perhaps rephrase it a bit more, to something like:

To update OldestInitializedPage, WALBufMappingLock needs to be held
exclusively, for reading no lock is required.

> + *
> + * 3) One can start traversing XLog buffers from OldestInitializedPage
> + * till InitializedUpTo to list out all valid WAL records and stats, and
> + * expose them via SQL-callable functions to users.
> + *
> + * 4) XLog buffers array is inherently organized as a circular, sorted and
> + * rotated array with OldestInitializedPage as pivot with the property
> + * where LSN of previous buffer page (if valid) is greater than
> + * OldestInitializedPage and LSN of next buffer page (if valid) is greater
> + * than OldestInitializedPage.
> + */
> + XLogRecPtr OldestInitializedPage;

It seems a bit odd to name a LSN containing variable *Page...

> /*
> * InsertTimeLineID is the timeline into which new WAL is being inserted
> * and flushed. It is zero during recovery, and does not change once set.
> @@ -590,6 +629,10 @@ static ControlFileData *ControlFile = NULL;
> #define NextBufIdx(idx) \
> (((idx) == XLogCtl->XLogCacheBlck) ? 0 : ((idx) + 1))
>
> +/* Macro to retreat to previous buffer index. */
> +#define PreviousBufIdx(idx) \
> + (((idx) == 0) ? XLogCtl->XLogCacheBlck : ((idx) - 1))

I think it might be worth making these inlines and adding assertions that idx
is not bigger than XLogCtl->XLogCacheBlck?

> /*
> * XLogRecPtrToBufIdx returns the index of the WAL buffer that holds, or
> * would hold if it was in cache, the page containing 'recptr'.
> @@ -708,6 +751,10 @@ static void WALInsertLockAcquireExclusive(void);
> static void WALInsertLockRelease(void);
> static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
>
> +#ifdef USE_ASSERT_CHECKING
> +static bool IsXLogBuffersArraySorted(void);
> +#endif
> +
> /*
> * Insert an XLOG record represented by an already-constructed chain of data
> * chunks. This is a low-level routine; to construct the WAL record header
> @@ -1992,6 +2039,52 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
> XLogCtl->InitializedUpTo = NewPageEndPtr;
>
> npages++;
> +
> + /*
> + * Try updating oldest initialized XLog buffer page.
> + *
> + * Update it if we are initializing an XLog buffer page for the first
> + * time or if XLog buffers are full and we are wrapping around.
> + */
> + if (XLogRecPtrIsInvalid(XLogCtl->OldestInitializedPage) ||
> + XLogRecPtrToBufIdx(XLogCtl->OldestInitializedPage) == nextidx)
> + {
> + Assert(XLogCtl->OldestInitializedPage < NewPageBeginPtr);
> +
> + XLogCtl->OldestInitializedPage = NewPageBeginPtr;
> + }

Wait, isn't this too late? At this point the buffer can already be used by
GetXLogBuffers(). I think thi sneeds to happen at the latest just before
*((volatile XLogRecPtr *) &XLogCtl->xlblocks[nextidx]) = NewPageEndPtr;

Why is it legal to get here with XLogCtl->OldestInitializedPage being invalid?

> +
> +/*
> + * Returns whether or not a given WAL record is available in XLog buffers.
> + *
> + * Note that we don't read OldestInitializedPage under a lock, see description
> + * near its definition in xlog.c for more details.
> + *
> + * Note that caller needs to pass in an LSN known to the server, not a future
> + * or unwritten or unflushed LSN.
> + */
> +bool
> +IsWALRecordAvailableInXLogBuffers(XLogRecPtr lsn)
> +{
> + if (!XLogRecPtrIsInvalid(lsn) &&
> + !XLogRecPtrIsInvalid(XLogCtl->OldestInitializedPage) &&
> + lsn >= XLogCtl->OldestInitializedPage &&
> + lsn < XLogCtl->InitializedUpTo)
> + {
> + return true;
> + }
> +
> + return false;
> +}
> diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
> index a14126d164..35235010e6 100644
> --- a/src/include/access/xlog.h
> +++ b/src/include/access/xlog.h
> @@ -261,6 +261,7 @@ extern void ReachedEndOfBackup(XLogRecPtr EndRecPtr, TimeLineID tli);
> extern void SetInstallXLogFileSegmentActive(void);
> extern bool IsInstallXLogFileSegmentActive(void);
> extern void XLogShutdownWalRcv(void);
> +extern bool IsWALRecordAvailableInXLogBuffers(XLogRecPtr lsn);
>
> /*
> * Routines to start, stop, and get status of a base backup.
> --
> 2.34.1
>

> From db027d8f1dcb53ebceef0135287f120acf67cc21 Mon Sep 17 00:00:00 2001
> From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
> Date: Thu, 2 Nov 2023 15:36:11 +0000
> Subject: [PATCH v14] Allow WAL reading from WAL buffers
>
> This commit adds WALRead() the capability to read WAL from WAL
> buffers when possible. When requested WAL isn't available in WAL
> buffers, the WAL is read from the WAL file as usual. It relies on
> WALBufMappingLock so that no one replaces the WAL buffer page that
> we're reading from. It skips reading from WAL buffers if
> WALBufMappingLock can't be acquired immediately. In other words,
> it doesn't wait for WALBufMappingLock to be available. This helps
> reduce the contention on WALBufMappingLock.
>
> This commit benefits the callers of WALRead(), that are walsenders
> and pg_walinspect. They can now avoid reading WAL from the WAL
> file (possibly avoiding disk IO). Tests show that the WAL buffers
> hit ratio stood at 95% for 1 primary, 1 sync standby, 1 async
> standby, with pgbench --scale=300 --client=32 --time=900. In other
> words, the walsenders avoided 95% of the time reading from the
> file/avoided pread system calls:
> https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com
>
> This commit also benefits when direct IO is enabled for WAL.
> Reading WAL from WAL buffers puts back the performance close to
> that of without direct IO for WAL:
> https://www.postgresql.org/message-id/CALj2ACV6rS%2B7iZx5%2BoAvyXJaN4AG-djAQeM1mrM%3DYSDkVrUs7g%40mail.gmail.com
>
> This commit also paves the way for the following features in
> future:
> - Improves synchronous replication performance by replicating
> directly from WAL buffers.
> - A opt-in way for the walreceivers to receive unflushed WAL.
> More details here:
> https://www.postgresql.org/message-id/20231011224353.cl7c2s222dw3de4j%40awork3.anarazel.de
>
> Author: Bharath Rupireddy
> Reviewed-by: Dilip Kumar, Andres Freund
> Reviewed-by: Nathan Bossart, Kuntal Ghosh
> Discussion: https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com
> ---
> src/backend/access/transam/xlog.c | 205 ++++++++++++++++++++++++
> src/backend/access/transam/xlogreader.c | 41 ++++-
> src/include/access/xlog.h | 6 +
> 3 files changed, 250 insertions(+), 2 deletions(-)
>
> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
> index fdf2ef310b..ff5dccaaa7 100644
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -1753,6 +1753,211 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
> return cachedPos + ptr % XLOG_BLCKSZ;
> }

>

> +/*
> + * Read WAL from WAL buffers.
> + *
> + * Read 'count' bytes of WAL from WAL buffers into 'buf', starting at location
> + * 'startptr', on timeline 'tli' and return total read bytes.
> + *
> + * This function returns quickly in the following cases:
> + * - When passed-in timeline is different than server's current insertion
> + * timeline as WAL is always inserted into WAL buffers on insertion timeline.
> + * - When server is in recovery as WAL buffers aren't currently used in
> + * recovery.
> + *
> + * Note that this function reads as much as it can from WAL buffers, meaning,
> + * it may not read all the requested 'count' bytes. Caller must be aware of
> + * this and deal with it.
> + *
> + * Note that this function is not available for frontend code as WAL buffers is
> + * an internal mechanism to the server.
>
> + */
> +Size
> +XLogReadFromBuffers(XLogReaderState *state,
> + XLogRecPtr startptr,
> + TimeLineID tli,
> + Size count,
> + char *buf)
> +{
> + XLogRecPtr ptr;
> + XLogRecPtr cur_lsn;
> + Size nbytes;
> + Size ntotal;
> + Size nbatch;
> + char *batchstart;
> +
> + if (RecoveryInProgress())
> + return 0;
>
> + if (tli != GetWALInsertionTimeLine())
> + return 0;
> +
> + Assert(!XLogRecPtrIsInvalid(startptr));
> +
> + cur_lsn = GetFlushRecPtr(NULL);
> + if (unlikely(startptr > cur_lsn))
> + elog(ERROR, "WAL start LSN %X/%X specified for reading from WAL buffers must be less than current database system WAL LSN %X/%X",
> + LSN_FORMAT_ARGS(startptr), LSN_FORMAT_ARGS(cur_lsn));

Hm, why does this check belong here? For some tools it might be legitimate to
read the WAL before it was fully flushed.

> + /*
> + * Holding WALBufMappingLock ensures inserters don't overwrite this value
> + * while we are reading it. We try to acquire it in shared mode so that
> + * the concurrent WAL readers are also allowed. We try to do as less work
> + * as possible while holding the lock to not impact concurrent WAL writers
> + * much. We quickly exit to not cause any contention, if the lock isn't
> + * immediately available.
> + */
> + if (!LWLockConditionalAcquire(WALBufMappingLock, LW_SHARED))
> + return 0;

That seems problematic - that lock is often heavily contended. We could
instead check IsWALRecordAvailableInXLogBuffers() once before reading the
page, then read the page contents *without* holding a lock, and then check
IsWALRecordAvailableInXLogBuffers() again - if the page was replaced in the
interim we read bogus data, but that's a bit of a wasted effort.

> + ptr = startptr;
> + nbytes = count; /* Total bytes requested to be read by caller. */
> + ntotal = 0; /* Total bytes read. */
> + nbatch = 0; /* Bytes to be read in single batch. */
> + batchstart = NULL; /* Location to read from for single batch. */

What does "batch" mean?

> + while (nbytes > 0)
> + {
> + XLogRecPtr expectedEndPtr;
> + XLogRecPtr endptr;
> + int idx;
> + char *page;
> + char *data;
> + XLogPageHeader phdr;
> +
> + idx = XLogRecPtrToBufIdx(ptr);
> + expectedEndPtr = ptr;
> + expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
> + endptr = XLogCtl->xlblocks[idx];
> +
> + /* Requested WAL isn't available in WAL buffers. */
> + if (expectedEndPtr != endptr)
> + break;
> +
> + /*
> + * We found WAL buffer page containing given XLogRecPtr. Get starting
> + * address of the page and a pointer to the right location of given
> + * XLogRecPtr in that page.
> + */
> + page = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;
> + data = page + ptr % XLOG_BLCKSZ;
> +
> + /*
> + * The fact that we acquire WALBufMappingLock while reading the WAL
> + * buffer page itself guarantees that no one else initializes it or
> + * makes it ready for next use in AdvanceXLInsertBuffer(). However, we
> + * need to ensure that we are not reading a page that just got
> + * initialized. For this, we look at the needed page header.
> + */
> + phdr = (XLogPageHeader) page;
> +
> + /* Return, if WAL buffer page doesn't look valid. */
> + if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
> + phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) &&
> + phdr->xlp_tli == tli))
> + break;

I don't think this code should ever encounter a page where this is not the
case? We particularly shouldn't do so silently, seems that could hide all
kinds of problems.

> + /*
> + * Note that we don't perform all page header checks here to avoid
> + * extra work in production builds; callers will anyway do those
> + * checks extensively. However, in an assert-enabled build, we perform
> + * all the checks here and raise an error if failed.
> + */

Why?

> + /* Count what is wanted, not the whole page. */
> + if ((data + nbytes) <= (page + XLOG_BLCKSZ))
> + {
> + /* All the bytes are in one page. */
> + nbatch += nbytes;
> + ntotal += nbytes;
> + nbytes = 0;
> + }
> + else
> + {
> + Size navailable;
> +
> + /*
> + * All the bytes are not in one page. Deduce available bytes on
> + * the current page, count them and continue to look for remaining
> + * bytes.
> + */
s/deducate/deduct/? Perhaps better subtract?

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2023-11-04 00:44:44 Re: Inconsistent use of "volatile" when accessing shared memory?
Previous Message Andres Freund 2023-11-03 23:58:18 Re: Improve WALRead() to suck data directly from WAL buffers when possible