Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: bharath(dot)rupireddyforpostgres(at)gmail(dot)com
Cc: masao(dot)fujii(at)oss(dot)nttdata(dot)com, nathandbossart(at)gmail(dot)com, sfrost(at)snowman(dot)net, bossartn(at)amazon(dot)com, rjuju123(at)gmail(dot)com, michael(at)paquier(dot)xyz, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Add checkpoint and redo LSN to LogCheckpointEnd log message
Date: 2022-02-09 02:52:04
Message-ID: 20220209.115204.1794224638476710282.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi, Bharath.

At Tue, 8 Feb 2022 14:33:01 +0530, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote in
> On Tue, Feb 8, 2022 at 2:10 PM Kyotaro Horiguchi
> <horikyota(dot)ntt(at)gmail(dot)com> wrote:
> > > Thus.. the attached removes the ambiguity of of the proposed patch
> > > about the LSNs in the restartpoint-ending log message.
> >
> > Thoughts?
>
> Thanks for the patch. I have few comments on the
> v1-0001-Get-rid-of-unused-path-to-handle-concurrent-check.patch

Thanks for checking this!

> 1) Can we have this Assert right after "skipping restartpoint, already
> performed at %X/%X" error message block? Does it make any difference?
> My point is that if at all, we were to assert this, why can't we do it
> before CheckPointGuts?
> + /* We mustn't have a concurrent checkpoint that advances checkpoint LSN */
> + Assert(lastCheckPoint.redo > ControlFile->checkPointCopy.redo);

No. The assertion checks if something wrong has happend while
CheckPointGuts - which may take a long time - is running. If we need
to do that, it should be after CheckPointGuts. (However, I removed it
finally. See below.)

> 2) Related to the above Assert, do we really need an assertion or a FATAL error?

It's just to make sure in case where that happens by any chance in
future. But on second thought, as I mentioned, CreateRestartPoint is
called from standalone process or from checkpointer. I added that
assertion at the beginning of CreateRestartPoint. I think that
assertion is logically equal to the old assertion.

I remember that I felt uncomfortable with the lock-less behavior on
ControlFile, which makes the code a bit complex to read. So I moved
"PriorRedoPtr = " line to within the lock section just below.

Addition to that, I feel being confused by the parallel-use of
lastCheckPoint.redo and RedoRecPtr So I replaced lastCheckPoint.redo
after assiging the former to the latter with RedoRecPtr.

> 3) Let's be consistent with "crash recovery" - replace
> "archive-recovery" with "archive recovery"?
> + * We have exited from archive-recovery mode after this restartpoint
> + * started. Crash recovery ever after should always recover to the end

That's sensible. I found several existing use of archive-recovery in
xlog.c and a few other files but the fix for them is separated as
another patch (0005).

> 4) Isn't it enough to say "Crash recovery should always recover to the
> end of WAL."?
> + * started. Crash recovery ever after should always recover to the end

I think we need to explicitly say something like "we have exited
archive recovery while *this* restartpoint is running". I simplified
the sentence as the follows.

+ * Archive recovery have ended. Crash recovery ever after should
+ * always recover to the end of WAL.

> 5) Is there a reliable test case covering this code? Please point me
> if the test case is shared upthread somewhere.

Nothing. Looking from the opposite side, the existing code for the
competing restartpoint/checkpoint case should have not been exercised
at least for these several major versions. Instead, I added an
assertion at the beginning of CreateRestartPoint that asserts that
"this should be called only under standalone process or from
checkpointer.". If that assert doesn't fire while the whole test, it
should be the proof of the premise for this patch is correct.

> 6) So, with this patch, the v8 patch-set posted at [1] doesn't need
> any changes IIUC. If that's the case, please feel free to post all the
> patches together such that they get tested in cfbot.

The two are different fixes so I don't think they are ought to be
merged together.

> [1] - https://www.postgresql.org/message-id/CALj2ACUtZhTb%3D2ENkF3BQ3wi137uaGi__qzvXC-qFYC0XwjALw%40mail.gmail.com

As the old 0001, though I said it'fine :p) I added a comment that
reading ControlFile->checkPoint* is safe here.

The old 0002 (attached 0003) looks file to me.

The old 0003 (attached 0004):

+++ b/src/backend/access/rmgrdesc/xlogdesc.c
- appendStringInfo(buf, "redo %X/%X; "
+ appendStringInfo(buf, "redo lsn %X/%X; "

It is shown in the context of a checkpoint record, so I think it is
not needed or rather lengthning the dump line uselessly.

+++ b/src/backend/access/transam/xlog.c
- (errmsg("request to flush past end of generated WAL; request %X/%X, current position %X/%X",
+ (errmsg("request to flush past end of generated WAL; request lsn %X/%X, current lsn %X/%X",

+++ b/src/backend/replication/walsender.c
- (errmsg("requested starting point %X/%X is ahead of the WAL flush position of this server %X/%X",
+ (errmsg("requested starting point %X/%X is ahead of the WAL flush LSN of this server %X/%X",

"WAL" is upper-cased. So it seems rather strange that the "lsn" is
lower-cased. In the first place the message doesn't look like a
user-facing error message and I feel we don't need position or lsn
there..

+++ b/src/bin/pg_rewind/pg_rewind.c
- pg_log_info("servers diverged at WAL location %X/%X on timeline %u",
+ pg_log_info("servers diverged at WAL LSN %X/%X on timeline %u",

I feel that we don't need "WAL" there.

+++ b/src/bin/pg_waldump/pg_waldump.c
- printf(_(" -e, --end=RECPTR stop reading at WAL location RECPTR\n"));
+ printf(_(" -e, --end=RECPTR stop reading at WAL LSN RECPTR\n"));

Mmm.. "WAL LSN RECPTR" looks strange to me. In the first place I
don't think "RECPTR" is a user-facing term. Doesn't something like the
follows work?

+ printf(_(" -e, --end=WAL-LSN stop reading at WAL-LSN\n"));

In some changes in this patch shorten the main message text of
fprintf-ish functions. That makes the succeeding parameters can be
inlined.

0001: The fix of CreateRestartPoint
0002: Add LSNs to checkpoint logs (the main patch)
0003: Change "location" to "LSN" of pg_controldata
0004: The same as 0003 for other uses of "location".
0005: Unhyphnate "-recovery" terms.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
v9-0001-Get-rid-of-unused-path-to-handle-concurrent-check.patch text/x-patch 6.0 KB
v9-0002-Add-checkpoint-and-redo-LSN-to-LogCheckpointEnd-l.patch text/x-patch 3.0 KB
v9-0003-Change-location-to-lsn-in-pg_controldata.patch text/x-patch 2.6 KB
v9-0004-Change-location-to-lsn-in-user-facing-text.patch text/x-patch 15.1 KB
v9-0005-Get-rid-of-uses-of-some-hyphenated-words.patch text/x-patch 4.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2022-02-09 02:56:46 Re: Plug minor memleak in pg_dump
Previous Message Julien Rouhaud 2022-02-09 02:50:07 Unnecessary call to resetPQExpBuffer in getIndexes