Re: backup manifests

From: Noah Misch <noah(at)leadboat(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: David Steele <david(at)pgmasters(dot)net>, Stephen Frost <sfrost(at)snowman(dot)net>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Suraj Kharage <suraj(dot)kharage(at)enterprisedb(dot)com>, tushar <tushar(dot)ahuja(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Rushabh Lathia <rushabh(dot)lathia(at)gmail(dot)com>, Tels <nospam-pg-abuse(at)bloodgate(dot)com>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Jeevan Chalke <jeevan(dot)chalke(at)enterprisedb(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>
Subject: Re: backup manifests
Date: 2020-03-29 03:40:10
Message-ID: 20200329034010.GB2247658@rfd.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Mar 27, 2020 at 01:53:54PM -0400, Robert Haas wrote:
> - Replace a doc paragraph about the advantages and disadvantages of
> CRC-32C with one by Stephen Frost, with a slightly change by me that I
> thought made it sound more grammatical.

Defaulting to CRC-32C seems prudent to me:

- As Andres Freund said, SHA-512 is slow relative to storage now available.
Since gzip is a needlessly-slow choice for backups (or any application that
copies the compressed data just a few times), comparison to "gzip -6" speed
is immaterial.

- While I'm sure some other fast hash would be a superior default, introducing
a new algorithm is a bikeshed, as you said. This design makes it easy,
technically, for someone to introduce a new algorithm later. CRC-32C is not
catastrophically unfit for 1GiB files.

- Defaulting to SHA-512 would, in the absence of a WAL archive that also uses
a cryptographic hash function, give a false sense of having achieved some
coherent cryptographic goal. With the CRC-32C default, WAL and the rest get
similar protection. I'm discounting the case of using BASE_BACKUP without a
WAL archive, because I expect little intersection between sites "worried
enough to hash everything" and those "not worried enough to use an archive".
(On the other hand, the program that manages the WAL archive can reasonably
own hashing base backups; putting ownership in the server isn't achieving
much extra.)

> + <refnamediv>
> + <refname>pg_validatebackup</refname>
> + <refpurpose>verify the integrity of a base backup of a
> + <productname>PostgreSQL</productname> cluster</refpurpose>
> + </refnamediv>

> + <listitem>
> + <para>
> + <literal>pg_wal</literal> is ignored because WAL files are sent
> + separately from the backup, and are therefore not described by the
> + backup manifest.
> + </para>
> + </listitem>

Stephen Frost mentioned that a backup could pass validation even if
pg_basebackup were killed after writing the base backup and before finishing
the writing of pg_wal. One might avoid that by simply writing the manifest to
a temporary name and renaming it to the final name after populating pg_wal.

What do you think of having the verification process also call pg_waldump to
validate the WAL CRCs (shown upthread)? That looked helpful and simple.

I think this functionality doesn't belong in its own program. If you suspect
pg_basebackup or pg_restore will eventually gain the ability to merge
incremental backups into a recovery-ready base backup, I would put the
functionality in that program. Otherwise, I would put it in pg_checksums.
For me, part of the friction here is that the program description indicates
general verification, but the actual functionality merely checks hashes on a
directory tree that happens to represent a PostgreSQL base backup.

> + parse->pathname = palloc(raw_length + 1);

I don't see this freed anywhere; is it? (It's useful to make peak memory
consumption not grow in proportion to the number of files backed up.)

[This message is not a full code review.]

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-03-29 03:50:32 snapper vs. HEAD
Previous Message Andres Freund 2020-03-29 03:35:22 Re: Improving connection scalability: GetSnapshotData()