| From: | Xuneng Zhou <xunengzhou(at)gmail(dot)com> |
|---|---|
| To: | Ayush Tiwari <ayushtiwari(dot)slg01(at)gmail(dot)com> |
| Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <aekorotkov(at)gmail(dot)com> |
| Subject: | Re: [PATCH] Fix WAIT FOR LSN cleanup on subtransaction abort |
| Date: | 2026-05-06 08:46:04 |
| Message-ID: | CABPTF7VRHJ=mNg7q7z4V51K+Z7X9HVr+_Gc4O7bQaBoJRWsF6w@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi Ayush,
On Wed, May 6, 2026 at 3:18 PM Ayush Tiwari <ayushtiwari(dot)slg01(at)gmail(dot)com> wrote:
>
> Hi,
>
> I found a backend crash in WAIT FOR LSN when it is interrupted inside a
> savepoint and the session then waits again.
>
> I tried to find if it was already reported, but could not find it, so, posting it.
>
> While navigating I noticed WAIT FOR LSN cleanup is incomplete on
> subtransaction abort. An interrupt such as statement_timeout while
> waiting inside a savepoint leaves stale per-backend wait state,
> causing a later WAIT FOR LSN in the same backend to violate
> the wait-heap invariant and crash an assertion-enabled build.
>
> A small reproducer is:
>
> BEGIN;
> SAVEPOINT s;
> SET statement_timeout = '100ms';
> WAIT FOR LSN '<future-lsn>' WITH (MODE 'primary_flush');
> ROLLBACK TO s;
> SET statement_timeout = 0;
> WAIT FOR LSN '0/0' WITH (MODE 'primary_flush', TIMEOUT '10ms', NO_THROW);
> COMMIT;
>
> where <future-lsn> can be generated with:
>
> SELECT pg_current_wal_insert_lsn() + 10000000000;
>
> TRAP: failed Assert("!procInfo->inHeap"), File: "xlogwait.c"
>
> The attached patch mirrors the top-level abort cleanup by calling
> WaitLSNCleanup() from AbortSubTransaction(), after LWLockReleaseAll(). It
> also adds a TAP test to verify that WAIT FOR LSN can be reused in the same
> backend after a statement_timeout and ROLLBACK TO SAVEPOINT.
>
> Thoughts?
Thanks for reporting this. I agree with your analysis. We need to add
this clean-up into AbortSubTransaction. I've some comments on the
patch:
1) Update the comment of WaitLSNCleanup
Now the comment of this function says, "Clean up LSN waiters for
exiting process." After this patch, this description would be too
narrow because after a ROLLBACK TO SAVEPOINT, the same backend
continues running and may issue another WAIT FOR LSN. How about
changing it to something like:
/*
* Clean up any LSN wait state for the current process.
*/
2) How about turning the SET statement_timeout = '100ms'; into more
deterministic machinery used in other sections of the test file to
ensure that the registration actually occurs by starting the waiter in
a background psql session and waits until itself reports: wait_event =
'WaitForWalFlush' for that backend.
3) For the validation, consider replacing the fast success of '0/0'
with the timeout of the unreachable LSN used earlier. This would
reduce assumptions about the order between registration and the fast
check being fixed.
--
Best,
Xuneng
| Attachment | Content-Type | Size |
|---|---|---|
| v2-0001-Fix-WAIT-FOR-LSN-cleanup-on-subtransaction-abort.patch | application/octet-stream | 4.3 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Alex Guo | 2026-05-06 08:46:33 | Re: COPY JSON: use trailing commas in FORCE_ARRAY output |
| Previous Message | Antonin Houska | 2026-05-06 08:25:44 | Re: Adding REPACK [concurrently] |