Re: pg_veryfybackup can fail with a valid backup for TLI > 1

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: michael(at)paquier(dot)xyz
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: pg_veryfybackup can fail with a valid backup for TLI > 1
Date: 2021-08-20 06:33:37
Message-ID: 20210820.153337.875399999696544353.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thank you for the comment.

At Thu, 19 Aug 2021 15:51:38 +0900, Michael Paquier <michael(at)paquier(dot)xyz> wrote in
> On Wed, Aug 18, 2021 at 02:30:31PM +0900, Kyotaro Horiguchi wrote:
> > The "Start-LSN" above is the beginning of timeline 2, not the
> > backup-start LSN. The segment had been removed by the checkpoint.
>
> Good catch. That's broken, and I find cleaner the logic to compare
> the TLI numbers rather than the start or end position of the TLI to
> check if a TLI should have a parent one.
>
> > The comment for AddWALInfoToBackupManifest() says:
> > > * Add information about the WAL that will need to be replayed when restoring
> > > * this backup to the manifest.
>
> Yes, backup manifests include one array with all the timelines that
> require coverage. This case would only impact backups taken from
> standbys, which have timeline jumps while the backup is taken with
> parent nodes promoted. That's easy enough to test with a chain of
> standbys, some pgbench and pg_basebackup --max-rate to slow down the
> whole.

(I was very annoyed that pg_basbackup responds to --max-rate=1 only by
"out of range" without informing of the valid range..)

That's looks like a domino falling. I had the following result, which
looks fine.

"WAL-Ranges": [
{ "Timeline": 6, "Start-LSN": "0/630C7E8", "End-LSN": "0/630C850" },
{ "Timeline": 5, "Start-LSN": "0/630C780", "End-LSN": "0/630C7E8" },
{ "Timeline": 4, "Start-LSN": "0/630C718", "End-LSN": "0/630C780" },
{ "Timeline": 3, "Start-LSN": "0/630C6B0", "End-LSN": "0/630C718" },
{ "Timeline": 2, "Start-LSN": "0/5000028", "End-LSN": "0/630C6B0" }
],

00000006.history:
1 0/173F268 no recovery target specified
2 0/630C6B0 no recovery target specified
3 0/630C718 no recovery target specified
4 0/630C780 no recovery target specified
5 0/630C7E8 no recovery target specified

> > So I concluded that it's a thinko.
> >
> > Please see the attached. It needs to be back-patched to 13 but 13
> > doesn't accept the patch as is due to wording chages in TAP tests.
>
> + if (starttli == entry->tli)
> tl_beginptr = startptr;
> + else
> + {
> + tl_beginptr = entry->begin;
>
> I would add a comment here. This does not need to be complicated,
> say:
> "If this timeline entry matches with the timeline on which the backup
> started, WAL needs to be checked from the start LSN of the backup. If
> this entry refers to a newer timeline, WAL needs to be checked since
> the beginning of this timeline, so use the LSN where the timeline
> began."

Thanks for your help! I wanted to write something like that
there. Added it as-is.

> The TAP test is fancy, I like it. So let's keep it.
>
> +# base-backup runs a checkpoint but just to make sure.
> +$primary->safe_psql('postgres', 'SELECT pg_switch_wal();
> CHECKPOINT;');
> I don't think you need this part to get a failure as the checkpoint
> from the base backup will be sufficient.

Yes, it is just redandunt as (I tried to) said in the comment. (I
meant by the "just to make sure" "just to let readers of the test code
notice that a checkpoint is crucial there", but it doesn't seem
working:() Instead, I added a comment just before pg_basebackup.

# base-backup below runs a checkpoint, which removes the first segment
# of the current timeline

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
v2-0001-Fix-basebackup-to-generate-correct-WAL-Ranges-inf_14-master.patch text/x-patch 3.4 KB
v2-0001-Fix-basebackup-to-generate-correct-WAL-Ranges-inf_13.patch text/x-patch 3.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Pyhalov 2021-08-20 07:13:30 Re: Push down time-related SQLValue functions to foreign server
Previous Message Andrey Borodin 2021-08-20 05:45:44 Re: reporting TID/table with corruption error