Quick Links

Re: Improve WALRead() to suck data directly from WAL buffers when possible

From:	Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Subject:	Re: Improve WALRead() to suck data directly from WAL buffers when possible
Date:	2023-11-02 17:08:38
Message-ID:	CALj2ACV123Ke6wn05QCnCgJhFpcarGCv7sKM1p39t+qpUNHUgQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Sat, Oct 28, 2023 at 2:22 AM Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
>
> I think I see what you are saying: WALRead() is at a lower level than
> the XLogReaderRoutine callbacks, because it's used by the .page_read
> callbacks.
>
> That makes sense, but my first interpretation was that WALRead() is
> above the XLogReaderRoutine callbacks because it calls .segment_open
> and .segment_close. To me that sounds like a layering violation, but it
> exists today without your patch.

Right. WALRead() is a common function used by most if not all
page_read callbacks. Typically, the page_read callbacks code has 2
parts - first determine the target/start LSN and second read WAL (via
WALRead() for instance).

> I suppose the question is: should reading from the WAL buffers an
> intentional thing that the caller does explicitly by specific callers?
> Or is it an optimization that should be hidden from the caller?
>
> I tend toward the former, at least for now.

Yes, it's an optimization that must be hidden from the caller.

> I suspect that when we do
> some more interesting things, like replicating unflushed data, we will
> want reading from buffers to be a separate step, not combined with
> WALRead(). After things in this area settle a bit then we might want to
> refactor and combine them again.

As said previously, the new XLogReadFromBuffers() function is generic
and extensible in the way that anyone with a target/start LSN
(corresponding to flushed or written-but-not-yet-flushed WAL) and TLI
can call it to read from WAL buffers. It's just that the patch
currently uses it where it makes sense i.e. in WALRead(). But, it's
usable in, say, a page_read callback reading unflushed WAL from WAL
buffers.

> > If someone wants to read unflushed WAL, the typical way to implement
> > it is to write a new page_read callback
> > read_local_unflushed_xlog_page/logical_read_unflushed_xlog_page or
> > similar without WALRead() but just the new function
> > XLogReadFromBuffers to read from WAL buffers and return.
>
> Then why is it being called from WALRead() at all?

The patch focuses on reading flushed WAL from WAL buffers if
available, not the unflushed WAL at all; that's why it's in WALRead()
before reading from the WAL file using pg_pread().

I'm trying to make a point that the XLogReadFromBuffers() enables one
to read unflushed WAL from WAL buffers (if really wanted for future
features like replicate from WAL buffers as a new opt-in feature to
improve the replication performance).

> > I prefer to keep the partial hit handling as-is just
> > in case:
>
> So a "partial hit" is essentially a narrow race condition where one
> page is read from buffers, and it's valid; and by the time it gets to
> the next page, it has already been evicted (along with the previously
> read page)?
>
> In other words, I think you are describing a case where
> eviction is happening faster than the memcpy()s in a loop, which is
> certainly possible due to scheduling or whatever, but doesn't seem like
> the common case.
>
> The case I'd expect for a partial read is when the startptr points to
> an evicted page, but some later pages in the requested range are still
> present in the buffers.
>
> I'm not really sure whether either of these cases matters, but if we
> implement one and not the other, there should be some explanation.

At any given point of time, WAL buffer pages are maintained as a
circularly sorted array in an ascending order from
OldestInitializedPage to InitializedUpTo (new pages are inserted at
this end). Also, the current patch holds WALBufMappingLock while
reading the buffer pages, meaning, no one can replace the buffer pages
until reading is finished. Therefore, the above two described partial
hit cases can't happen - when reading multiple pages if the first page
is found to be existing in the buffer pages, it means the other pages
must exist too because of the circular and sortedness of the WAL
buffer page array.

Here's an illustration with WAL buffers circular array (idx, LSN) of
size 10 elements with contents as {(0, 160), (1, 170), (2, 180), (3,
90), (4, 100), (5, 110), (6, 120), (7, 130), (8, 140), (9, 150)} and
current InitializedUpTo pointing to page at LSN 180, idx 2.
- Read 6 pages starting from LSN 80. Nothing is read from WAL buffers
as the page at LSN 80 doesn't exist despite other 5 pages starting
from LSN 90 exist.
- Read 6 pages starting from LSN 90. All the pages exist and are read
from WAL buffers.
- Read 6 pages starting from LSN 150. Note that WAL is currently
flushed only up to page at LSN 180 and the callers won't ask for
unflushed WAL read. If a caller asks for an unflushed WAL read
intentionally or unintentionally, XLogReadFromBuffers() reads only 4
pages starting from LSN 150 to LSN 180 and will leave the remaining 2
pages for the caller to deal with. This is the partial hit that can
happen. Therefore, retaining the partial hit code in WALRead() as-is
in the current patch is needed IMV.

> > Yes, I proposed that idea in another thread -
> > https://www.postgresql.org/message-id/CALj2ACVFSirOFiABrNVAA6JtPHvA9iu%2Bwp%3DqkM9pdLZ5mwLaFg%40mail.gmail.com
> > .
> > If that looks okay, I can add it to the next version of this patch
> > set.
>
> The code in the email above still shows a call to:
>
> /*
> * Requested WAL is available in WAL buffers, so recheck the
> existence
> * under the WALBufMappingLock and read if the page still exists,
> otherwise
> * return.
> */
> LWLockAcquire(WALBufMappingLock, LW_SHARED);
>
> and I don't think that's required. How about something like:
>
> endptr1 = XLogCtl->xlblocks[idx];
> /* Requested WAL isn't available in WAL buffers. */
> if (expectedEndPtr != endptr1)
> break;
>
> pg_read_barrier();
> ...
> memcpy(buf, data, bytes_read_this_loop);
> ...
> pg_read_barrier();
> endptr2 = XLogCtl->xlblocks[idx];
> if (expectedEndPtr != endptr2)
> break;
>
> ntotal += bytes_read_this_loop;
> /* success; move on to next page */
>
> I'm not sure why GetXLogBuffer() doesn't just use pg_atomic_read_u64().
> I suppose because xlblocks are not guaranteed to be 64-bit aligned?
> Should we just align it to 64 bits so we can use atomics? (I don't
> think it matters in this case, but atomics would avoid the need to
> think about it.)

WALBufMappingLock protects both xlblocks and WAL buffer pages [1][2].
I'm not sure how using the memory barrier, not WALBufMappingLock,
prevents writers from replacing WAL buffer pages while readers reading
the pages. FWIW, GetXLogBuffer() reads the xlblocks value without the
lock but it confirms the WAL existence under the lock and gets the WAL
buffer page under the lock [3].

I'll reiterate the WALBufMappingLock thing for this patch - the idea
is to know whether or not the WAL at a given LSN exists in WAL buffers
without acquiring WALBufMappingLock; if exists acquire the lock in
shared mode, read from WAL buffers and then release. WAL buffer pages
are organized as a circular array with the InitializedUpTo as the
latest filled WAL buffer page. If there's a way to track the oldest
filled WAL buffer page (OldestInitializedPage), at any given point of
time, the elements of the circular array are sorted in an ascending
order from OldestInitializedPage to InitializedUpTo. With this
approach, no lock is required to know if the WAL at given LSN exists
in WAL buffers, we can just do this if lsn >=
XLogCtl->OldestInitializedPage && lsn < XLogCtl->InitializedUpTo. I
proposed this idea here
https://www.postgresql.org/message-id/CALj2ACVgi6LirgLDZh%3DFdfdvGvKAD%3D%3DWTOSWcQy%3DAtNgPDVnKw%40mail.gmail.com.
I've pulled that patch in here as 0001 to showcase its use for this
feature.

> * Why check that startptr is earlier than the flush pointer, but not
> startptr+count? Also, given that we intend to support reading unflushed
> data, it would be good to comment that the function still works past
> the flush pointer, and that it will be safe to remove later (right?).

That won't work, see the comment below. Actual flushed LSN may not
always be greater than startptr+count. GetFlushRecPtr() check in
XLogReadFromBuffers() is similar to what pg_walinspect has in
GetCurrentLSN().

/*
* Even though we just determined how much of the page can be validly read
* as 'count', read the whole page anyway. It's guaranteed to be
* zero-padded up to the page boundary if it's incomplete.
*/
if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ, tli,
&errinfo))

> * An "Assert(!RecoveryInProgress())" would be more appropriate than
> an error. Perhaps we will remove even that check in the future to
> achieve cascaded replication of unflushed data.
>
> In this case, I think some of those errors are not really necessary
> anyway, though:
>
> * XLogReadFromBuffers shouldn't take a timeline argument just to
> demand that it's always equal to the wal insertion timeline.

I've changed XLogReadFromBuffers() to return as-if nothing was read
(cache miss) when the server is in recovery or the requested TLI is
not the current server's insertion TLI. It is better than failing with
ERRORs so that the callers don't have to have any checks for recovery
or TLI.

PSA v14 patch set.

[1]
* WALBufMappingLock: must be held to replace a page in the WAL buffer cache.

[2]
* and xlblocks values certainly do. xlblocks values are protected by
* WALBufMappingLock.
*/
char *pages; /* buffers for unwritten XLOG pages */
XLogRecPtr *xlblocks; /* 1st byte ptr-s + XLOG_BLCKSZ */

[3]
* However, we don't hold a lock while we read the value. If someone has
* just initialized the page, it's possible that we get a "torn read" of
* the XLogRecPtr if 64-bit fetches are not atomic on this platform. In
* that case we will see a bogus value. That's ok, we'll grab the mapping
* lock (in AdvanceXLInsertBuffer) and retry if we see anything else than
* the page we're looking for. But it means that when we do this unlocked
* read, we might see a value that appears to be ahead of the page we're
* looking for. Don't PANIC on that, until we've verified the value while
* holding the lock.
*/
expectedEndPtr = ptr;
expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;

endptr = XLogCtl->xlblocks[idx];

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment	Content-Type	Size
v14-0001-Track-oldest-initialized-WAL-buffer-page.patch	application/octet-stream	8.5 KB
v14-0002-Allow-WAL-reading-from-WAL-buffers.patch	application/octet-stream	11.1 KB
v14-0003-Add-test-module-for-verifying-WAL-read-from-WAL-.patch	application/octet-stream	9.6 KB

In response to

Re: Improve WALRead() to suck data directly from WAL buffers when possible at 2023-10-27 20:52:56 from Jeff Davis

Responses

Re: Improve WALRead() to suck data directly from WAL buffers when possible at 2023-11-03 07:05:06 from Jeff Davis
Re: Improve WALRead() to suck data directly from WAL buffers when possible at 2023-11-04 00:43:48 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tomas Vondra	2023-11-02 17:37:32	Re: Postgres picks suboptimal index after building of an extended statistics
Previous Message	Marko Tiikkaja	2023-11-02 16:33:50	Re: [HACKERS] Allow INSTEAD OF DELETE triggers to modify the tuple for RETURNING