Re: Return value of XLogInsertRecord() for XLOG_SWITCH record

From: ZizhuanLiu X-MAN <44973863(at)qq(dot)com>
To: 반지현 <rring0727(at)gmail(dot)com>, cca5507 <cca5507(at)qq(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Return value of XLogInsertRecord() for XLOG_SWITCH record
Date: 2026-06-13 13:33:46
Message-ID: tencent_3701D4DD8D5CF0176D9CB4E3132ADF939608@qq.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>To: ZizhuanLiu X-MAN <44973863(at)qq(dot)com>, cca5507 <cca5507(at)qq(dot)com>
>Cc: pgsql-hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
>Subject: Re: Return value of XLogInsertRecord() for XLOG_SWITCH record
>Date: 2026-06-11 06:53:19
>Message-ID: 3CAC439F-E1F0-4F37-BE58-CC9D7CDE889C(at)gmail(dot)com
>Views: Whole Thread | Raw Message | Download mbox | Resend email
>Lists: pgsql-hackers
>Hi ZizhuanLiu,
>
>
>
>Thanks for the caller survey — classifying the call sites by how they
>consume the return value is a useful way to frame the risk, and I agree
>pg_switch_wal() is the one path where the LSN escapes to external
>consumers.
>
>
>
>To help bound that risk, I re-ran my April comparison on a second
>platform (Windows 11, MSYS2/gcc 16.1.0, commit 9d141466ff, 19beta1),
>building unpatched master and v2 side by side and running the same
>sequence on identically-initialized clusters:
>
>
>
> before_switch: 0/017CF958 (both builds)
> switch_1: 0/017CF970 (both builds)
> after_1: 0/02000000 (both builds)
> switch_2/3: 0/02000000, no-op (both builds)
>
>
>
>The two builds returned byte-for-byte identical LSNs at every step,
>and "make check" passes on the patched build (All 245 tests passed).
>Incidentally this also confirms numerically that the MAXALIGN() change
>is a no-op: 0/017CF958 + 24 = 0/017CF970, since SizeOfXLogRecord is
>already 8-byte aligned.
>
>
>
>As far as I can tell, the only input where v2 can return a different
>value than current master is when the XLOG_SWITCH record ends exactly
>on a page boundary (StartPos at page offset XLOG_BLCKSZ - 24): old code
>takes the cross-page branch and adds a page header size even though no
>part of the record lies on the next page, while v2 returns the boundary
>itself. The v2 value is the one consistent with the end-pointer
>convention of XLogBytePosToEndRecPtr(). So any external tool that
>observed a different value in that rare alignment was depending on a
>value inconsistent with PostgreSQL's own end-pointer semantics — I'd
>read v2 as correcting that, rather than breaking compatibility.
>
>
>
>So in summary: identical SQL-visible behavior on all common paths
>(verified on Linux in April and Windows now), with the single divergent
>case being a consistency fix. If it would help, I could try to craft a
>reproduction of the exact page-boundary case (padding WAL position via
>pg_logical_emit_message), or write a small TAP test pinning down the
>boundary behavior of pg_switch_wal().
>
>
>
>Regards,
>Jihyun Bahn

Hi, 반지현

Thank you very much for your professional and comprehensive testing and analysis.
I’ve learned a great deal more about WAL internals through this discussion, including
details related to pg_logical_emit_message.

After further review and testing on my side, I fully agree with your reasoning.
Patch v2 corrects the inconsistent end-pointer calculation and does not break
compatibility with external tools.

Upon closer reflection, the original condition
```c
if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ)
```c
is intended to detect whether the XLOG_SWITCH record crosses a page boundary.
This branch then decides whether to add SizeOfXLogLongPHD (crossing WAL segment)
or SizeOfXLogShortPHD (crossing page only, same segment).

As we all know, EndPos points to the start of the next WAL record rather than the end
of the current one. The byte range occupied by any WAL record is [StartPos, EndPos).
Therefore, to correctly judge page crossing for the switch record, the condition should be adjusted to:
```c
if (StartPos / XLOG_BLCKSZ != (EndPos - 1) / XLOG_BLCKSZ)
```c
This expression accurately preserves the original design intent. I have prepared a v3 patch
based on v2, with only this single conditional expression modified.

After additional consideration: the helper functions that reserve space for WAL records
already compute the [StartPos, EndPos) range, including the *EndPos = XLogBytePosToEndRecPtr()
logic we discussed earlier. The XLOG_SWITCH record is a special case because it intentionally
consumes the remaining space in the current segment, forcing us to recalculate EndPos
a second time in this code block.

To avoid duplicate calculations, we can extend the parameters of the local function
ReserveXLogSwitch() by adding an output argument XLogRecPtr *actual_EndPos
to store the value computed via *EndPos = XLogBytePosToEndRecPtr(). With this change,
the logic here becomes cleaner: after inserted the XLOG_SWITCH WAL, we simply restore
EndPos to the precomputed actual_EndPos. All end-pointer calculations are unified in
one place — this is what v4 implements.

Both v3 and v4 compile successfully and pass tests covering 5 scenarios (including multiple corner cases).
The patch files and detailed test logs are attached for your review:
1.An XLOG_SWITCH record fits entirely within a WAL block with ample trailing free space;
2.An XLOG_SWITCH record exactly fills a WAL block with zero remaining free space;
3.Multiple consecutive calls to pg_switch_wal() do not trigger an actual WAL segment rotation;
4.The XLOG_SWITCH record crosses block boundaries within a single WAL segment;
5.The XLOG_SWITCH record crosses both block boundaries and WAL segment files.

Feel free to share your thoughts and point out any mistakes.

regards,
--
ZizhuanLiu (X-MAN) 
44973863(at)qq(dot)com

Attachment Content-Type Size
v3-0001-Keep-the-return-value-of-XLogInsertRecord-for-XLO.patch application/octet-stream 1.2 KB
v4-0001-Keep-the-return-value-of-XLogInsertRecord-for-XLO.patch application/octet-stream 3.2 KB
test-result-v3.txt application/octet-stream 16.1 KB
test-result-v4.txt application/octet-stream 14.3 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2026-06-13 13:54:01 Re: Subquery pull-up increases jointree search space
Previous Message Tomas Vondra 2026-06-13 13:04:06 Re: Subquery pull-up increases jointree search space