Re: [Patch] ALTER SYSTEM READ ONLY

From: Amul Sul <sulamul(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [Patch] ALTER SYSTEM READ ONLY
Date: 2021-11-17 05:43:32
Message-ID: CAAJ_b97XtC7o+czLv=UiyVtCR3Xd5eSi_rUhFaX5i=ao55TmoQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Nov 13, 2021 at 2:18 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Mon, Nov 8, 2021 at 8:20 AM Amul Sul <sulamul(at)gmail(dot)com> wrote:
> > Attached is the rebased version of refactoring as well as the
> > pg_prohibit_wal feature patches for the latest master head (commit #
> > 39a3105678a).
>
> I spent a lot of time today studying 0002, and specifically the
> question of whether EndOfLog must be the same as
> XLogCtl->replayEndRecPtr and whether EndOfLogTLI must be the same as
> XLogCtl->replayEndTLI.
>
> The answer to the former question is "no" because, if we don't enter
> redo, XLogCtl->replayEndRecPtr won't be initialized at all. If we do
> enter redo, then I think it has to be the same unless something very
> weird happens. EndOfLog gets set like this:
>
> XLogBeginRead(xlogreader, LastRec);
> record = ReadRecord(xlogreader, PANIC, false, replayTLI);
> EndOfLog = EndRecPtr;
>
> In every case that exists in our regression tests, EndRecPtr is the
> same before these three lines of code as it is afterward. However, if
> you test with recovery_target=immediate, you can get it to be
> different, because in that case we drop out of the redo loop after
> calling recoveryStopsBefore() rather than after calling
> recoveryStopsAfter(). Similarly I'm fairly sure that if you use
> recovery_target_inclusive=off you can likewise get it to be different
> (though I discovered the hard way that recovery_target_inclusive=off
> is ignored when you use recovery_target_name). It seems like a really
> bad thing that neither recovery_target=immediate nor
> recovery_target_inclusive=off have any tests, and I think we ought to
> add some.
>

recovery/t/003_recovery_targets.pl has test for
recovery_target=immediate but not for recovery_target_inclusive=off, we
can add that for recovery_target_lsn, recovery_target_time, and
recovery_target_xid case only where it affects.

> Anyway, in effect, these three lines of code have the effect of
> backing up the xlogreader by one record when we stop before rather
> than after a record that we're replaying. What that means is that
> EndOfLog is going to be the end+1 of the last record that we actually
> replayed. There might be one more record that we read but did not
> replay, and that record won't impact the value we end up with in
> EndOfLog. Now, XLogCtl->replayEndRecPtr is also that end+1 of the last
> record that we actually replayed. To put that another way, there's no
> way to exit the main redo loop after we set XLogCtl->replayEndRecPtr
> and before we change LastRec. So in the cases where
> XLogCtl->replayEndRecPtr gets initialized at all, it can only be
> different from EndOfLog if something different happens when we re-read
> the last-replayed WAL record than what happened when we read it the
> first time. That seems unlikely, and would be unfortunate it if it did
> happen. I am inclined to think that it might be better not to reread
> the record at all, though.

There are two reasons that the record is reread; first, one that you
have just explained where the redo loop drops out due to
recoveryStopsBefore() and another one is that InRecovery is false.

In the formal case at the end, redo while-loop does read a new record
which in effect updates EndRecPtr and when we breaks the loop, we do
reach the place where we do reread record -- where we do read the
record (i.e. LastRec) before the record that redo loop has read and
which correctly sets EndRecPtr. In the latter case, definitely, we
don't need any adjustment to EndRecPtr.

So technically one case needs reread but that is also not needed, we
have that value in XLogCtl->lastReplayedEndRecPtr. I do agree that we
do not need to reread the record, but EndOfLog and EndOfLogTLI should
be set conditionally something like:

if (InRecovery)
{
EndOfLog = XLogCtl->lastReplayedEndRecPtr;
EndOfLogTLI = XLogCtl->lastReplayedTLI;
}
else
{
EndOfLog = EndRecPtr;
EndOfLogTLI = replayTLI;
}

> As far as this patch goes, I think we need
> a solution that doesn't involve fetching EndOfLog from a variable
> that's only sometimes initialized and then not doing anything with it
> except in the cases where it was initialized.
>

Another reason could be EndOfLog changes further in the following case:

/*
* Actually, if WAL ended in an incomplete record, skip the parts that
* made it through and start writing after the portion that persisted.
* (It's critical to first write an OVERWRITE_CONTRECORD message, which
* we'll do as soon as we're open for writing new WAL.)
*/
if (!XLogRecPtrIsInvalid(missingContrecPtr))
{
Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
EndOfLog = missingContrecPtr;
}

Now only solution that I can think is to copy EndOfLog (so
EndOfLogTLI) into shared memory.

> As for EndOfLogTLI, I'm afraid I don't think that's the same thing as
> XLogCtl->replayEndTLI. Now, it's hard to be sure, because I don't
> think the regression tests contain any scenarios where we run recovery
> and the values end up different. However, I think that the code sets
> EndOfLogTLI to the TLI of the last WAL file that we read, and I think
> XLogCtl->replayEndTLI gets set to the timeline from which that WAL
> record originated. So imagine that we are looking for WAL that ought
> to be in 000000010000000000000003 but we don't find it; instead we
> find 000000020000000000000003 because our recovery target timeline is
> 2, or something that has 2 in its history. We will read the WAL for
> timeline 1 from this file which has timeline 2 in the file name. I
> think if recovery ends in this file before the timeline switch, these
> values will be different. I did not try to construct a test case for
> this today due to not having enough time, so it's possible that I'm
> wrong about this, but that's how it looks to me from the code.
>

I am not sure, I have understood this scenario due to lack of
expertise in this area -- Why would the record we looking that ought
to be in 000000010000000000000003 we don't find it? Possibly WAL
corruption or that file is missing?

Regards,
Amul

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2021-11-17 05:56:03 Re: Skipping logical replication transactions on subscriber side
Previous Message Andrey Borodin 2021-11-17 05:39:58 Slow client can delay replication despite max_standby_streaming_delay set