Re: Improve WALRead() to suck data directly from WAL buffers when possible

From: Melih Mutlu <m(dot)melihmutlu(at)gmail(dot)com>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Subject: Re: Improve WALRead() to suck data directly from WAL buffers when possible
Date: 2024-01-22 14:03:28
Message-ID: CAGPVpCQrRD5OXjxb=qAP6jfM_S3b=d9iT59HR7wzUxnJbBYE1Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Bharath,

Thanks for working on this. It seems like a nice improvement to have.

Here are some comments on 0001 patch.

1- xlog.c
+ /*
+ * Fast paths for the following reasons: 1) WAL buffers aren't in use when
+ * server is in recovery. 2) WAL is inserted into WAL buffers on current
+ * server's insertion TLI. 3) Invalid starting WAL location.
+ */

Shouldn't the comment be something like "2) WAL is *not* inserted into WAL
buffers on current server's insertion TLI" since the condition to break is tli
!= GetWALInsertionTimeLine()

2-
+ /*
+ * WAL being read doesn't yet exist i.e. past the current insert position.
+ */
+ if ((startptr + count) > reservedUpto)
+ return ntotal;

This question may not even make sense but I wonder whether we can read from
startptr only to reservedUpto in case of startptr+count exceeds
reservedUpto?

3-
+ /* Wait for any in-progress WAL insertions to WAL buffers to finish. */
+ if ((startptr + count) > LogwrtResult.Write &&
+ (startptr + count) <= reservedUpto)
+ WaitXLogInsertionsToFinish(startptr + count);

Do we need to check if (startptr + count) <= reservedUpto as we already
verified this condition a few lines above?

4-
+ Assert(nread > 0);
+ memcpy(dst, data, nread);
+
+ /*
+ * Make sure we don't read xlblocks down below before the page
+ * contents up above.
+ */
+ pg_read_barrier();
+
+ /* Recheck if the read page still exists in WAL buffers. */
+ endptr = pg_atomic_read_u64(&XLogCtl->xlblocks[idx]);
+
+ /* Return if the page got initalized while we were reading it. */
+ if (expectedEndPtr != endptr)
+ break;
+
+ /*
+ * Typically, we must not read a WAL buffer page that just got
+ * initialized. Because we waited enough for the in-progress WAL
+ * insertions to finish above. However, there can exist a slight
+ * window after the above wait finishes in which the read buffer page
+ * can get replaced especially under high WAL generation rates. After
+ * all, we are reading from WAL buffers without any locks here. So,
+ * let's not count such a page in.
+ */
+ phdr = (XLogPageHeader) page;
+ if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
+ phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) &&
+ phdr->xlp_tli == tli))
+ break;

I see that you recheck if the page still exists and so at the end. What
would you think about memcpy'ing only after being sure that we will need
and use the recently read data? If we break the loop during the recheck, we
simply discard the data read in the latest attempt. I guess that this may
not be a big deal but the data would be unnecessarily copied into the
destination in such a case.

5- xlogreader.c
+ nread = XLogReadFromBuffers(startptr, tli, count, buf);
+
+ if (nread > 0)
+ {
+ /*
+ * Check if its a full read, short read or no read from WAL buffers.
+ * For short read or no read, continue to read the remaining bytes
+ * from WAL file.
+ *
+ * XXX: It might be worth to expose WAL buffer read stats.
+ */
+ if (nread == count) /* full read */
+ return true;
+ else if (nread < count) /* short read */
+ {
+ buf += nread;
+ startptr += nread;
+ count -= nread;
+ }

Typo in the comment. Should be like "Check if *it's* a full read, short
read or no read from WAL buffers."

Also I don't think XLogReadFromBuffers() returns anything less than 0 and
more than count. Is verifying nread > 0 necessary? I think if nread does
not equal to count, we can simply assume that it's a short read. (or no
read at all in case nread is 0 which we don't need to handle specifically)

6-
+ /*
+ * We determined how much of the page can be validly read as 'count', read
+ * that much only, not the entire page. Since WALRead() can read the page
+ * from WAL buffers, in which case, the page is not guaranteed to be
+ * zero-padded up to the page boundary because of the concurrent
+ * insertions.
+ */

I'm not sure about pasting this into the most places we call WalRead().
Wouldn't it be better if we mention this somewhere around WALRead() only
once?

Best,
--
Melih Mutlu
Microsoft

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Aleksander Alekseev 2024-01-22 14:18:50 Re: Remove unused fields in ReorderBufferTupleBuf
Previous Message Aleksander Alekseev 2024-01-22 14:02:03 Re: Remove unused fields in ReorderBufferTupleBuf