Re: backup manifests

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Suraj Kharage <suraj(dot)kharage(at)enterprisedb(dot)com>, tushar <tushar(dot)ahuja(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Rushabh Lathia <rushabh(dot)lathia(at)gmail(dot)com>, Tels <nospam-pg-abuse(at)bloodgate(dot)com>, David Steele <david(at)pgmasters(dot)net>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Jeevan Chalke <jeevan(dot)chalke(at)enterprisedb(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>
Subject: Re: backup manifests
Date: 2020-03-25 16:50:14
Message-ID: CA+TgmoYOKC_8o-AR1jTQs0mOrFx=_Rcy5udor1m-LjyJNiSWPQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Mar 25, 2020 at 9:31 AM Stephen Frost <sfrost(at)snowman(dot)net> wrote:
> I get that the default for manifest is 'no', but I don't really see how
> that means that the lack of saying anything about checksums should mean
> "give me crc32c checksums". It's really rather common that if we don't
> specify something, it means don't do that thing- like an 'ORDER BY'
> clause.

That's a fair argument, but I think the other relevant principle is
that we try to give people useful defaults for things. I think that
checksums are a sufficiently useful thing that having the default be
not to do it doesn't make sense. I had the impression that you and
David were in agreement on that point, actually.

> There were also comments made up-thread about how it might not be great
> for larger (eg: 1GB files, like we tend to have quite a few of...), and
> something about it being a 40 year old algorithm..

Well, the 512MB "limit" for CRC-32C means only that for certain very
specific types of errors, detection is not guaranteed above that file
size. So if you have a single flipped bit, for example, and the file
size is greater than 512MB, then CRC-32C has only a 99.9999999767169%
chance of detecting the error, whereas if the file size is less than
512MB, it is 100% certain, because of the design of the algorithm. But
nine nines is plenty, and neither SHA nor our page-level checksums
provide guaranteed error detection properties anyway.

I'm not sure why the fact that it's a 40-year-old algorithm is
relevant. There are many 40-year-old algorithms that are very good.
Generally, if we discover that we're using bad 40-year-old algorithms,
like Knuth's tape sorting stuff, we eventually figure out how to
replace them with something else that's better. But there's no reason
to retire an algorithm simply because it's old. I have not heard
anyone say, for example, that we should stop using CRC-32C for XLOG
checksums. We continue to use it for that purpose because it (1) is
highly likely to detect any errors and (2) is very fast. Those are the
same reasons why I think it's a good fit for this case.

My guess is that if this patch is adopted as currently proposed, we
will eventually need to replace the cryptographic hash functions due
to the march of time. As I'm sure you realize, the problem with hash
functions that are designed to foil an adversary is that adversaries
keep getting smarter. So, eventually someone will probably figure out
how to do something nefarious with SHA-512. Some other technique that
nobody's cracked yet will need to be adopted, and then people will
begin trying to crack that, and the whole thing will repeat. But I
suspect that we can keep using the same non-cryptographic hash
function essentially forever. It does not matter that people know how
the algorithm works because it makes no pretensions of trying to foil
an opponent. It is just trying to mix up the bits in such a way that a
change to the file is likely to cause a change in the checksum. The
bit-mixing properties of the algorithm do not degrade with the passage
of time.

> I'm sure that sha256 takes a lot more time than crc32c, I'm certainly
> not trying to dispute that, but what's relevent here is how much it
> impacts the time required to run the overall backup (including sync'ing
> it to disk, and possibly network transmission time.. if we're just
> comparing the time to run it through memory then, sure, the sha256
> computation time might end up being quite a bit of the time, but that's
> not really that interesting of a test..).

I think that http://postgr.es/m/38e29a1c-0d20-fc73-badd-ca05f7f07ffa@pgmasters.net
is one of the more interesting emails on this topic. My conclusion
from that email, and the ones that led up to it, was that there is a
40-50% overhead from doing a SHA checksum, but in pgbackrest, users
don't see it because backups are compressed. Because the compression
uses so much CPU time, the additional overhead from the SHA checksum
is only a few percent more. But I don't think that it would be smart
to slow down uncompressed backups by 40-50%. That's going to cause a
problem for somebody, almost for sure.

> Maybe:
>
> "Using a SHA hash function provides a cryptographically secure digest
> of each file for users who wish to verify that the backup has not been
> tampered with, while the CRC32C algorithm provides a checksum which is
> much faster to calculate and good at catching errors due to accidental
> changes but is not resistent to targeted modifications. Note that, to
> be useful against an adversary who has access to the backup, the backup
> manifest would need to be stored securely elsewhere or otherwise
> verified to have not been modified since the backup was taken."
>
> This at least talks about things in a positive direction (SHA hash
> functions do this, CRC32C does that) rather than in a negative tone.

Cool. I like it.

> Anyway, my point here was really just that *pg_validatebackup* is about
> validating backups taken with pg_basebackup. While it's possible that
> it could be used for backups taken with other tools, I don't think
> that's really part of its actual mandate or that we're going to actively
> work to add such support in the future.

I think you're kind just nitpicking here, because the statement that
pg_validatebackup can validate not only a backup taken by
pg_basebackup but also a backup taken in using some compatible method
is just a tautology. But I'll remove the reference.

> In particular, the sentence right above this list is:
>
> "Certain files and directories are excluded from verification:"
>
> but we actually do verify the manifest, that's all I'm saying here.
>
> Maybe rewording that a bit is what would help, say:
>
> "Certain files and directories are not included in the manifest:"

Well, that'd be wrong, though. It's true that backup_manifest won't
have an entry in the manifest, and neither will WAL files, but
postgresql.auto.conf will. We'll just skip complaining about it if the
checksum doesn't match or whatever. The server generates manifest
entries for everything, and the client decides not to pay attention to
some of them because it knows that pg_basebackup may have made certain
changes that were not known to the server.

> Yeah, I get that it's not easy to figure out how to validate the WAL,
> but I stand by my opinion that it's simply not acceptable to exclude the
> necessary WAL from verification so and to claim that a backup is valid
> when we haven't checked the WAL.

I hear that, but I don't agree that having nothing is better than
having this much committed. I would be fine with renaming the tool
(pg_validatebackupmanifest? pg_validatemanifest?), or with updating
the documentation to be more clear about what is and is not checked,
but I'm not going to extent the tool to do totally new things for
which we don't even have an agreed design yet. I believe in trying to
create patches that do one thing and do it well, and this patch does
that. The fact that it doesn't do some other thing that is
conceptually related yet different is a good thing, not a bad one.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2020-03-25 16:56:10 Re: potential stuck lock in SaveSlotToPath()
Previous Message Tom Lane 2020-03-25 16:45:46 Re: Some improvements to numeric sqrt() and ln()