Re: Improve WALRead() to suck data directly from WAL buffers when possible

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Subject: Re: Improve WALRead() to suck data directly from WAL buffers when possible
Date: 2024-01-25 09:05:30
Message-ID: CALj2ACV=C1GZT9XQRm4iN1NV1T=hLA_hsGWNx2Y5-G+mSwdhNg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 23, 2024 at 9:37 AM Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
>
> On Mon, 2024-01-22 at 12:12 -0800, Andres Freund wrote:
> > I still think that anything that requires such checks shouldn't be
> > merged. It's completely bogus to check page contents for validity
> > when we
> > should have metadata telling us which range of the buffers is valid
> > and which
> > not.
>
> The check seems entirely unnecessary, to me. A leftover from v18?
>
> I have attached a new patch (version "19j") to illustrate some of my
> previous suggestions. I didn't spend a lot of time on it so it's not
> ready for commit, but I believe my suggestions are easier to understand
> in code form.

> Note that, right now, it only works for XLogSendPhysical(). I believe
> it's best to just make it work for 1-3 callers that we understand well,
> and we can generalize later if it makes sense.

+1 to do it for XLogSendPhysical() first. Enabling it for others can
just be done as something like the attached v20-0003.

> I'm still not clear on why some callers are reading XLOG_BLCKSZ
> (expecting zeros at the end), and if it's OK to just change them to use
> the exact byte count.

"expecting zeros at the end" - this can't always be true as the WAL
can get flushed after determining the flush ptr before reading it from
the WAL file. FWIW, here's what I've tried previoulsy -
https://github.com/BRupireddy2/postgres/tree/ensure_extra_read_WAL_page_is_zero_padded_at_the_end_WIP,
the tests hit the Assert(false); added. Which means, the zero-padding
comment around WALRead() call-sites isn't quite right.

/*
* Even though we just determined how much of the page can be validly read
* as 'count', read the whole page anyway. It's guaranteed to be
* zero-padded up to the page boundary if it's incomplete.
*/
if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ, tli,

I think this needs to be discussed separately. If okay, I'll start a new thread.

> Also, if we've detected that the first requested buffer has been
> evicted, is there any value in continuing the loop to see if more
> recent buffers are available? For example, if the requested LSNs range
> over buffers 4, 5, and 6, and 4 has already been evicted, should we try
> to return LSN data from 5 and 6 at the proper offset in the dest
> buffer? If so, we'd need to adjust the API so the caller knows what
> parts of the dest buffer were filled in.

I'd second this capability for now to keep the API simple and clear,
but we can consider expanding it as needed.

I reviewed the v19j and attached v20 patch set:

1.
* The caller must ensure that it's reasonable to read from the WAL buffers,
* i.e. that the requested data is from the current timeline, that we're not
* in recovery, etc.

I still think the XLogReadFromBuffers can just return in any of the
above cases instead of comments. I feel we must assume the caller is
going to ask the WAL from a different timeline and/or in recovery and
design the API to deal with it. Done that way in v20 patch.

2. Fixed some typos, reworded a few comments (i.e. used "current
insert/write position" instead of "Insert/Write pointer" like
elsewhere), ran pgindent.

3.
- * XXX probably this should be improved to suck data directly from the
- * WAL buffers when possible.

Removed the above comment before WALRead() since we have that facility
now. Perhaps, we can say the callers can suck data directly from the
WAL buffers using XLogReadFromBuffers. But I have no strong opinion on
this.

4.
+ * Most callers will have already updated LogwrtResult when determining
+ * how far to read, but it's OK if it's out of date. (XXX: is it worth
+ * taking a spinlock to update LogwrtResult and check again before calling
+ * WaitXLogInsertionsToFinish()?)

If the callers use GetFlushRecPtr() to determine how far to read,
LogwrtResult will be *reasonably* latest, otherwise not. If
LogwrtResult is a bit old, XLogReadFromBuffers will call
WaitXLogInsertionsToFinish which will just loop over all insertion
locks and return.

As far as the current WAL readers are concerned, we don't need an
explicit spinlock to determine LogwrtResult because all of them use
GetFlushRecPtr() to determine how far to read. If there's any caller
that's not updating LogwrtResult at all, we can consider reading
LogwrtResult it ourselves in future.

5. I think the two requirements specified at
https://www.postgresql.org/message-id/20231109205836.zjoawdrn4q77yemv%40awork3.anarazel.de
still hold with the v19j.

5.1 Never allow WAL being read that's past
XLogBytePosToRecPtr(XLogCtl->Insert->CurrBytePos) as it does not
exist.
5.2 If the to-be-read LSN is between XLogCtl->LogwrtResult->Write and
XLogBytePosToRecPtr(Insert->CurrBytePos) we need to call
WaitXLogInsertionsToFinish() before copying the data.

+ if (upto > LogwrtResult.Write)
+ {
+ XLogRecPtr writtenUpto = WaitXLogInsertionsToFinish(upto, false);
+
+ upto = Min(upto, writtenUpto);
+ nbytes = upto - startptr;
+ }

XLogReadFromBuffers ensures the above two with adjusting upto based on
Min(upto, writtenUpto) as WaitXLogInsertionsToFinish returns the
oldest insertion that is still in-progress.

For instance, the current write LSN is 100, current insert LSN is 150
and upto is 200 - we only read upto 150 if startptr is < 150; we don't
read anything if startptr is > 150.

6. I've modified the test module in v20-0002 patch as follows:
6.1 Renamed the module to read_wal_from_buffers stripping "test_"
which otherwise is making the name longer. Longer names can cause
failures on some Windows BF members if the PATH/FILE name is too long.
6.2 Tweaked tests to hit WaitXLogInsertionsToFinish() and upto =
Min(upto, writtenUpto); in XLogReadFromBuffers.

PSA v20 patch set.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
v20-0001-Add-XLogReadFromBuffers.patch application/octet-stream 9.6 KB
v20-0002-Add-test-module-for-verifying-WAL-read-from-WAL-.patch application/octet-stream 9.2 KB
v20-0003-Use-XLogReadFromBuffers-in-more-places.patch application/octet-stream 3.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2024-01-25 09:09:42 Re: remaining sql/json patches
Previous Message Hayato Kuroda (Fujitsu) 2024-01-25 09:05:14 RE: speed up a logical replica setup