Re: Improve WALRead() to suck data directly from WAL buffers when possible

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Improve WALRead() to suck data directly from WAL buffers when possible
Date: 2022-12-26 08:50:07
Message-ID: CALj2ACU9cfAcfVsGwUqXMace_7rfSBJ7+hXVJfVV1jnspTDGHQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Dec 25, 2022 at 4:55 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> > > This adds copying of the whole page (at least) at every WAL *record*
> > > read,
> >
> > In the worst case yes, but that may not always be true. On a typical
> > production server with decent write traffic, it happens that the
> > callers of WALRead() read a full WAL page of size XLOG_BLCKSZ bytes or
> > MAX_SEND_SIZE bytes.
>
> I agree with this.
>
> > > This patch copies the bleeding edge WAL page without recording the
> > > (next) insertion point nor checking whether all in-progress insertion
> > > behind the target LSN have finished. Thus the copied page may have
> > > holes. That being said, the sequential-reading nature and the fact
> > > that WAL buffers are zero-initialized may make it work for recovery,
> > > but I don't think this also works for replication.
> >
> > WALRead() callers are smart enough to take the flushed bytes only.
> > Although they read the whole WAL page, they calculate the valid bytes.
>
> Right
>
> On first read the patch looks good, although it needs some more
> thoughts on 'XXX' comments in the patch.

Thanks a lot for reviewing.

Here are some open points that I mentioned in v1 patch:

1.
+ * XXX: Perhaps, measuring the immediate lock availability and its impact
+ * on concurrent WAL writers is a good idea here.

It was shown in my testng upthread [1] that the patch does no harm in
this regard. It will be great if other members try testing in their
respective environments and use cases.

2.
+ * XXX: Perhaps, returning if lock is not immediately available a good idea
+ * here. The caller can then go ahead with reading WAL from WAL file.

After thinking a bit more on this, ISTM that doing the above is right
to not cause any contention when the lock is busy. I've done so in the
v2 patch.

3.
+ * XXX: Perhaps, quickly finding if the given WAL record is in WAL buffers
+ * a good idea here. This avoids unnecessary lock acquire-release cycles.
+ * One way to do that is by maintaining oldest WAL record that's currently
+ * present in WAL buffers.

I think by doing the above we might end up creating a new point of
contention. Because shared variables to track min and max available
LSNs in the WAL buffers will need to be protected against all the
concurrent writers. Also, with the change that's done in (2) above,
that is, quickly exiting if the lock was busy, this comment seems
unnecessary to worry about. Hence, I decided to leave it there.

4.
+ * XXX: Perhaps, we can further go and validate the found page header,
+ * record header and record at least in assert builds, something like
+ * the xlogreader.c does and return if any of those validity checks
+ * fail. Having said that, we stick to the minimal checks for now.

I was being over-cautious initially. The fact that we acquire
WALBufMappingLock while reading the needed WAL buffer page itself
guarantees that no one else initializes it/makes it ready for next use
in AdvanceXLInsertBuffer(). The checks that we have for page header
(xlp_magic, xlp_pageaddr and xlp_tli) in the patch are enough for us
to ensure that we're not reading a page that got just initialized. The
callers will anyway perform extensive checks on page and record in
XLogReaderValidatePageHeader() and ValidXLogRecordHeader()
respectively. If any such failures occur after reading WAL from WAL
buffers, then that must be treated as a bug IMO. Hence, I don't think
we need to do the above.

> And also I do not like that XLogReadFromBuffers() is using 3 bools
> hit/partial hit/miss, instead of this we can use an enum or some
> tristate variable, I think that will be cleaner.

Yeah, that seems more verbose, all that information can be deduced
from requested bytes and read bytes, I've done so in the v2 patch.

Please review the attached v2 patch further.

I'm also attaching two helper patches (as .txt files) herewith for
testing that basically adds WAL read stats -
USE-ON-HEAD-Collect-WAL-read-from-file-stats.txt - apply on HEAD and
monitor pg_stat_replication for per-walsender WAL read from WAL file
stats. USE-ON-PATCH-Collect-WAL-read-from-buffers-and-file-stats.txt -
apply on v2 patch and monitor pg_stat_replication for per-walsender
WAL read from WAL buffers and WAL file stats.

[1] https://www.postgresql.org/message-id/CALj2ACXUbvON86vgwTkum8ab3bf1%3DHkMxQ5hZJZS3ZcJn8NEXQ%40mail.gmail.com

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
USE-ON-HEAD-Collect-WAL-read-from-file-stats.txt text/plain 15.9 KB
v2-0001-Improve-WALRead-to-suck-data-directly-from-WAL-bu.patch application/octet-stream 8.1 KB
USE-ON-PATCH-Collect-WAL-read-from-buffers-and-file-stats.txt text/plain 19.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2022-12-26 09:14:32 Re: Time delayed LR (WAS Re: logical replication restrictions)
Previous Message Dilip Kumar 2022-12-26 08:42:41 Re: Time delayed LR (WAS Re: logical replication restrictions)