Re: using an end-of-recovery record in all cases

From: Amul Sul <sulamul(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Julien Rouhaud <rjuju123(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: using an end-of-recovery record in all cases
Date: 2022-04-20 06:54:07
Message-ID: CAAJ_b95gEKrfwE+HUpxqCYpdY8cpyZ4oPeVf4c6mxvQO6LDPAQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Apr 19, 2022 at 2:14 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Sat, Jan 15, 2022 at 11:52 PM Julien Rouhaud <rjuju123(at)gmail(dot)com> wrote:
> > The cfbot reports that this version of the patch doesn't apply anymore:
>
> Here is a new version of the patch which, unlike v1, I think is
> something we could seriously consider applying (not before v16, of
> course). It now removes CHECKPOINT_END_OF_RECOVERY completely, and I
> attach a second patch as well which nukes checkPoint.PrevTimeLineID as
> well.
>
> I mentioned two problems with $SUBJECT in the first email with this
> subject line. One was a bug, which Noah has since fixed (thanks,
> Noah). The other problem is that LogStandbySnapshot() and a bunch of
> its friends expect latestCompletedXid to always be a normal XID, which
> is a problem because (1) we currently set nextXid to 3 and (2) at
> startup, we compute latestCompletedXid = nextXid - 1. The current code
> dodges this kind of accidentally: the checkpoint that happens at
> startup is a "shutdown checkpoint" and thus skips logging a standby
> snapshot, since a shutdown checkpoint is a sure indicator that there
> are no running transactions. With the changes, the checkpoint at
> startup happens after we've started allowing write transactions, and
> thus a standby snapshot needs to be logged also. In the cases where
> the end-of-recovery record was already being used, the problem could
> have happened already, except for the fact that those cases involve a
> standby promotion, which doesn't happen during initdb. I explored a
> few possible ways of solving this problem.
>
> The first thing I considered was replacing latestCompletedXid with a
> name like incompleteXidHorizon. The idea is that, where
> latestCompletedXid is the highest XID that is known to have committed
> or aborted, incompleteXidHorizon would be the lowest XID such that all
> known commits or aborts are for lower XIDs. In effect,
> incompleteXidHorizon would be latestCompletedXid + 1. Since
> latestCompletedXid is always normal except when it's 2,
> incompleteXidHorizon would be normal in all cases. Initially this
> seemed fairly promising, but it kind of fell down when I realized that
> we copy latestCompletedXid into
> ComputeXidHorizonsResult.latest_completed. That seemed to me to make
> the consequences of the change a bit more far-reaching than I liked.
> Also, it wasn't entirely clear to me that I wouldn't be introducing
> any off-by-one errors into various wraparound calculations with this
> approach.
>
> The second thing I considered was skipping LogStandbySnapshot() during
> initdb. There are two ways of doing this and neither of them seem that
> great to me. Something that does work is to skip LogStandbySnapshot()
> when in single user mode, but there is no particular reason why WAL
> generated in single user mode couldn't be replayed on a standby, so
> this doesn't seem great. It's too big a hammer for what we really
> want, which is just to suppress this during initdb. Another way of
> approaching it is to skip it in bootstrap mode, but that actually
> doesn't work: initdb then fails during the post-bootstrapping step
> rather than during bootstrapping. I thought about patching around that
> by forcing the code that generates checkpoint records to forcibly
> advance an XID of 3 to 4, but that seemed like working around the
> problem from the wrong end.
>
> So ... I decided that the best approach here seems to be to just skip
> FirstNormalTransactionId and use FirstNormalTransactionId + 1 for the
> first write transaction that the cluster ever processes. That's very
> simple and doesn't seem likely to break anything else. On the downside
> it seems a bit grotty, but I don't see anything better, and on the
> whole, I think with this approach we come out substantially ahead.

IIUC, the failure was something like this on initdb:

running bootstrap script ... TRAP:
FailedAssertion("TransactionIdIsNormal(CurrentRunningXacts->latestCompletedXid)",
File: "procarray.c", Line: 2892, PID: 60363)

/bin/postgres(ExceptionalCondition+0xb9)[0xb3917d]
/bin/postgres(GetRunningTransactionData+0x36c)[0x96aa26]
/bin/postgres(LogStandbySnapshot+0x64)[0x974393]
/bin/postgres(CreateCheckPoint+0x67f)[0x5928bf]
/bin/postgres(RequestCheckpoint+0x26)[0x8ca649]
/bin/postgres(StartupXLOG+0xf51)[0x591126]
/bin/postgres(InitPostgres+0x188)[0xb4f2ac]
/bin/postgres(BootstrapModeMain+0x4d3)[0x5ac6de]
/bin/postgres(main+0x275)[0x7ca72e]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f71af82d445]
/bin/postgres[0x48aae9]
child process was terminated by signal 6: Aborted
initdb: removing data directory "/inst/data"

That was happening because RequestCheckpoint() was called from StartupXLOG()
unconditionally, but with the v5 patch that is not true.

If my understanding is correct then we don't need any handling
for latestCompletedXid, at least in this patch.

Regards,
Amul

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2022-04-20 07:20:48 Re: Logical replication timeout problem
Previous Message Amit Kapila 2022-04-20 06:00:02 Re: Stabilizing the test_decoding checks, take N