Re: Make mesage at end-of-recovery less scary.

From: Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: pashkin(dot)elfe(at)gmail(dot)com, Michael Paquier <michael(at)paquier(dot)xyz>, "Bossart, Nathan" <bossartn(at)amazon(dot)com>, David Steele <david(at)pgmasters(dot)net>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, jtc331(at)gmail(dot)com, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: Make mesage at end-of-recovery less scary.
Date: 2022-02-09 12:01:02
Message-ID: CAE9k0P=9gBHdBff2Rhq9pwa9ghNft_aHHSmO51K+EdDnfnkpFQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 9, 2022 at 1:14 PM Kyotaro Horiguchi
<horikyota(dot)ntt(at)gmail(dot)com> wrote:
>
> Hi, Ashutosh.
>
> At Tue, 8 Feb 2022 18:35:34 +0530, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com> wrote in
> > Here are some of my review comments on the v11 patch:
>
> Thank you for taking a look on this.
>
> > - (errmsg_internal("reached end of WAL in
> > pg_wal, entering archive recovery")));
> > + (errmsg_internal("reached end of WAL at %X/%X
> > on timeline %u in %s during crash recovery, entering archive
> > recovery",
> > + LSN_FORMAT_ARGS(ErrRecPtr),
> > + replayTLI,
> > + xlogSourceNames[currentSource])));
> >
> > Why crash recovery? Won't this message get printed even during PITR?
>
> It is in the if-block with the following condition.
>
> > * If archive recovery was requested, but we were still doing
> > * crash recovery, switch to archive recovery and retry using the
> > * offline archive. We have now replayed all the valid WAL in
> > * pg_wal, so we are presumably now consistent.
> ...
> > if (!InArchiveRecovery && ArchiveRecoveryRequested)
>
> This means archive-recovery is requested but not started yet. That is,
> we've just finished crash recovery. The existing comment cited
> together is mentioning that.
>
> At the end of PITR (or archive recovery), the other code works.
>

This is quite understandable, the point here is that the message that
we are emitting says, we have just finished reading the wal files in
the pg_wal directory during crash recovery and are now entering
archive recovery when we are actually doing point-in-time recovery
which seems a bit misleading.

> > /*
> > * If we haven't emit an error message, we have safely reached the
> > * end-of-WAL.
> > */
> > if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG)
> > {
> > char *fmt;
> >
> > if (StandbyMode)
> > fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode");
> > else if (InArchiveRecovery)
> > fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery");
> > else
> > fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery");
>
> The last among the above messages is choosed when archive-recovery is
> not requested at all.
>
> > I just did a PITR and could see these messages in the logfile.
>
> Yeah, the log lines are describing that the server starting with crash
> recovery to run PITR.
>
> > 2022-02-08 18:00:44.367 IST [86185] LOG: starting point-in-time
> > recovery to WAL location (LSN) "0/5227790"
> > 2022-02-08 18:00:44.368 IST [86185] LOG: database system was not
> > properly shut down; automatic recovery in progress
>
> Well. I guess that the "automatic recovery" is ambiguous. Does it
> make sense if the second line were like the follows instead?
>
> + 2022-02-08 18:00:44.368 IST [86185] LOG: database system was not properly shut down; crash recovery in progress
>

Well, according to me the current message looks fine.

> > Lastly, are we also planning to backport this patch?
>
> This is apparent a behavioral change, not a bug fix, which I think we
> regard as not appropriate for back-patching.
>
>
> As the result, I made the following chages in the version 11.
>
> 1. Changed the condition for the "end-of-WAL" message from
> emode_for_corrupt_record to the EndOfWAL flag.
>
> 2. Corrected the wording of end-of-wal to end-of-WAL.
>
> 3. In the "reached end of WAL" message, report the LSN of the
> beginning of failed record instead of the beginning of the
> last-succeeded record.
>
> 4. In the changed message in walreceiver.c, I swapped LSN and timeline
> so that they are in the same order with other similar messages.
>

Thanks for sharing this information.

==

Here is one more comment:

One more comment:

+# identify REDO WAL file
+my $cmd = "pg_controldata -D " . $node->data_dir();
+my $chkptfile;
+$cmd = ['pg_controldata', '-D', $node->data_dir()];
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m,
+ "checkpoint file is identified");
+my $chkptfile = $1;

$chkptfile is declared twice in the same scope. We can probably remove
the first one.

--
With Regards,
Ashutosh Sharma.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Pyhalov 2022-02-09 12:18:12 Justin Pryzby <pryzby@telsasoft.com>
Previous Message Peter Eisentraut 2022-02-09 11:48:35 Re: Database-level collation version tracking