Re: Improve WALRead() to suck data directly from WAL buffers when possible

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Subject: Re: Improve WALRead() to suck data directly from WAL buffers when possible
Date: 2023-11-03 07:05:06
Message-ID: 2ef04861c0f77e7ae78b703770cc2bbbac3d85e6.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 2023-11-02 at 22:38 +0530, Bharath Rupireddy wrote:
> > I suppose the question is: should reading from the WAL buffers an
> > intentional thing that the caller does explicitly by specific
> > callers?
> > Or is it an optimization that should be hidden from the caller?
> >
> > I tend toward the former, at least for now.
>
> Yes, it's an optimization that must be hidden from the caller.

As I said, I tend toward the opposite: that specific callers should
read from the buffers explicitly in the cases where it makes sense.

I don't think this is the most important point right now though, let's
sort out the other details.

> >
> At any given point of time, WAL buffer pages are maintained as a
> circularly sorted array in an ascending order from
> OldestInitializedPage to InitializedUpTo (new pages are inserted at
> this end).

I don't see any reference to OldestInitializedPage or anything like it,
with or without your patch. Am I missing something?

> - Read 6 pages starting from LSN 80. Nothing is read from WAL buffers
> as the page at LSN 80 doesn't exist despite other 5 pages starting
> from LSN 90 exist.

This is what I imagined a "partial hit" was: read the 5 pages starting
at 90. The caller would then need to figure out how to read the page at
LSN 80 from the segment files.

I am not saying we should support this case; perhaps it doesn't matter.
I'm just describing why that term was confusing for me.

> If a caller asks for an unflushed WAL read
> intentionally or unintentionally, XLogReadFromBuffers() reads only 4
> pages starting from LSN 150 to LSN 180 and will leave the remaining 2
> pages for the caller to deal with. This is the partial hit that can
> happen.

To me that's more like an EOF case. "Partial hit" sounds to me like the
data exists but is not available in the cache (i.e. go to the segment
files); whereas if it encountered the end, the data is not available at
all.

> >
> WALBufMappingLock protects both xlblocks and WAL buffer pages [1][2].
> I'm not sure how using the memory barrier, not WALBufMappingLock,
> prevents writers from replacing WAL buffer pages while readers
> reading
> the pages.

It doesn't *prevent* that case, but it does *detect* that case. We
don't want to prevent writers from replacing WAL buffers, because that
would mean we are slowing down the critical WAL writing path.

Let me explain the potential problem cases, and how the barriers
prevent them:

Potential problem 1: the page is not yet resident in the cache at the
time the memcpy begins. In this case, the first read barrier would
ensure that the page is also not yet resident at the time xlblocks[idx]
is read into endptr1, and we'd break out of the loop.

Potential problem 2: the page is evicted before the memcpy finishes. In
this case, the second read barrier would ensure that the page was also
evicted before xlblocks[idx] is read into endptr2, and again we'd
detect the problem and break out of the loop.

I assume here that, if xlblocks[idx] holds the endPtr of the desired
page, all of the bytes for that page are resident at that moment. I
don't think that's true right now: AdvanceXLInsertBuffers() zeroes the
old page before updating xlblocks[nextidx]. I think it needs something
like:

pg_atomic_write_u64(&XLogCtl->xlblocks[nextidx], InvalidXLogRecPtr);
pg_write_barrier();

before the MemSet.

I didn't review your latest v14 patch yet.

Regards,
Jeff Davis

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Quan Zongliang 2023-11-03 07:27:16 Improvement discussion of custom and generic plans
Previous Message Richard Guo 2023-11-03 06:46:56 Re: Fix bogus Asserts in calc_non_nestloop_required_outer