Quick Links

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

From:	Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4
Date:	2024-11-11 13:51:04
Message-ID:	CAExHW5sC6qQ=GZHij2o=Kj4gsMnJuE_PExs39Pn3-BqgAL4haA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Sat, Nov 9, 2024 at 5:05 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>
> On 11/8/24 15:57, Ashutosh Bapat wrote:
> > ...
> >
> > After examining the code before reading [2], I came to the same
> > conclusion as Masahiko-san in [2]. We install candidate_restart_lsn
> > based on the running transaction record whose LSN is between
> > restart_lsn and confirmed_flush_lsn. Since candidate_restart_valid of
> > such candidates would be lesser than any confirmed_flush_lsn received
> > from downstream. I am surprised that the fix suggested by Masahiko-san
> > didn't work though. The fix also fix the asymmetry, between
> > LogicalIncreaseXminForSlot and LogicalIncreaseRestartDecodingForSlot,
> > that you have pointed out in your next email. What behaviour do you
> > see with that fix applied?
> >
> >
> > [1] https://www.postgresql.org/message-id/Yz2hivgyjS1RfMKs%40depesz.com
> > [2] https://www.postgresql.org/message-id/CAD21AoBVhYnGBuW_o%3DwEGgTp01qiHNAx1a14b1X9kFXmuBe%3Dsg%40mail.gmail.com
> >
> >
>
> I read that message (and the earlier discussion multiple times) while
> investigating the issue, and TBH it's not very clear to me what the
> conclusion is :-(
>
> There's some discussion about whether the candidate fields should be
> reset on release or not. There are even claims that it might be
> legitimate to not reset the fields and update the restart_lsn. Using
> such "stale" LSN values seems rather suspicious to me, but I don't have
> a proof that it's incorrect. FWIW this unclarity is what I mentioned the
> policy/contract for candidate fields is not explained anywhere.
>
> That being said, I gave that fix a try - see the attached 0001 patch. It
> tweaks LogicalIncreaseRestartDecodingForSlot (it needs a bit more care
> because of the spinlock), and it adds a couple asserts to make sure the
> data.restart_lsn never moves back.
>
> And indeed, with this my stress test script does not crash anymore.

I think the problem is about processing older running transactions
record and setting data.restart_lsn based on the candidates those
records produce. But what is not clear to me is how come a newer
candidate_restart_lsn is available immediately upon WAL sender
restart. I.e. in the sequence of events you mentioned in your first
email
1. 344.139 LOG: starting logical decoding for slot "s1"

2. 344.139 DETAIL: Streaming transactions committing after 1/E97EAC30,
reading WAL from 1/E96FB4D0.

3. 344.140 LOG: logical decoding found consistent point at 1/E96FB4D0

4. 344.140 DETAIL: Logical decoding will begin using saved snapshot.

5. 344.140 LOG: LogicalConfirmReceivedLocation 1/E9865398

6. 344.140 LOG: LogicalConfirmReceivedLocation updating
data.restart_lsn to 1/E979D4C8 (from 1/E96FB4D0)
candidate_restart_valid 0/0 (from 1/E9865398)
candidate_restart_lsn 0/0 (from 1/E979D4C8)

how did candidate_restart_lsn = 1/E979D4C8 and candidate_restart_valid
= 1/E9865398 were set in ReplicationSlot after WAL sender? It means it
must have read and processed running transaction record at 1/E9865398.
If that's true, how come it went back to a running transactions WAL
record at 1/E979D4C8? It should be reading WAL records sequentially,
hence read 1/E979D4C8 first then 1/E9865398.

344.145 LOG: LogicalIncreaseRestartDecodingForSlot s1
candidate_restart_valid_lsn 1/E979D4C8 (0/0)
candidate_restart_lsn 1/E96FB4D0 (0/0)

--
Best Wishes,
Ashutosh Bapat

In response to

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4 at 2024-11-09 11:35:32 from Tomas Vondra

Responses

Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4 at 2024-11-11 14:17:36 from Tomas Vondra

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Ashutosh Bapat	2024-11-11 13:58:24	Re: Add html-serve target to autotools and meson
Previous Message	Heikki Linnakangas	2024-11-11 13:25:29	Re: Support LIKE with nondeterministic collations