Re: pgsql: Add TAP tests for pg_verify_checksums

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, Daniel Gustafsson <daniel(at)yesql(dot)se>, Magnus Hagander <magnus(at)hagander(dot)net>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Postgres hackers <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: pgsql: Add TAP tests for pg_verify_checksums
Date: 2018-10-20 04:25:19
Message-ID: CAOuzzgpAk3C6uAdyP6MQdamVRK5u05FkqRRmZOk0xqAs9DTqKA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers pgsql-hackers

Greetings,

On Fri, Oct 19, 2018 at 23:40 Michael Paquier <michael(at)paquier(dot)xyz> wrote:

> On Fri, Oct 19, 2018 at 06:43:18PM -0400, Tom Lane wrote:
> > Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com> writes:
> >> I don't think just reverting it is really acceptable.
> >
> > +several. I do not mind somebody writing and installing a better fix.
> > I do object to turning the buildfarm red again.
>
> I did not expect this thread to turn into a war zone. Anyway, there are
> a couple of things I agree with on this thread:
> - I agree with Andres point here:
> https://postgr.es/m/20181019171747.4uithw2sjkt6msne@alap3.anarazel.de
> A blacklist is fundamentally more difficult to maintain as there are
> way more things added in a data folder which do not have data checksums
> than things which have checksums. So using a blacklist approach looks
> unmaintainable in the long term. Future patches related to enabling
> online checksum verification make me worry if we keep the code like
> that. I can also easily imagine that anybody willing to use the
> pluggable storage API would like to put new files in tablespace-related
> data folders, relying on "base/" being the default system tablespace
>

If we are going to decide that we only deal with files in our directories
matching these whitelisted patterns, then shouldn’t we apply similar logic
in things like DROP DATABASE and any other cases where we perform actions
in a recursive manner across our database and table space directories?

Should we really be removing arbitrary files that we know nothing about,
after all?

What about pg_basebackup? Shall we update it to only stream through files
matching these patterns as those are the only files we consider ourselves
to be aware of?

I don’t buy off on any argument that presents pluggable storage as not
including some way for us to track and be aware of what files are
associated with that pluggable storage mechanism and which of those files
have checksums and how to verify them if they do. In other words, I sure
hope we don’t accept an approach like cstore *fdw* uses for pluggable
storage where the core system has no idea whatsoever about what these
random files dropped into our tablespace directories are.

- I agree with Stephen's point that we should decide if a file has
> checksums or not in a single place, and that we should use the same
> logic for base backups and pg_verify_checksums.

Despite it being a lot of the discussion, I don’t think there was ever
disagreement on this point.

- I agree with not doing a simple revert to not turn the buildfarm red
> again. This is annoying for animal maintainers. Andrew has done a very
> nice work in disabling manually those tests temporarily.

This is a red herring, and always was, so I’m rather unimpressed at how it
keeps coming up- no, I’m not advocating that we should just make the build
farm red and just leave it that way. Yes, we should fix this case, and fix
pg_basebackup, and maybe even try to add some regression tests which test
this exact same case in pg_basebackup, but making the build farm green is
*not* the only thing we should care about.

- The base backup logic deciding if a file has checksums looks broken to
> me: it misses files generated by EXEC_BACKEND, and any instance of
> Postgres using an extension with custom files and data checksums has its
> backups broken. cstore_fdw has been mentioned above, and I recall that
> Heroku for example enables data checksums. If you combine both, it
> basically means that such an instance cannot take base backups anymore
> while it was able to do so with pre-10 with default options. That's not
> cool.

This is incorrect, pg_basebackup will still back up and keep files which
fail checksum checks- logic which David Steele and I pushed for when the
checksumming logic was added to pg_basebackup, as I recall, but it’ll
complain and warn about these files ...

As it *should*, even if we weren’t doing checksum checks, because these are
random files that have been dropped into a PG directory that PG doesn’t
know anything about, which aren’t handled through WAL and therefore there’s
no way to know if they’ll be at all valid when they’re copied, or that
backing them up is at all useful.

Backups are absolutely important and we wouldn’t want a backup to be
aborted or failed due to checksum failures, in general, but the user should
be made aware of these failures. This approach of only taking
responsibility for the files we know of through these patterns could also
imply that we shouldn’t back up in pg_basebackup files that don’t match-
but that strikes me as another similarly risky approach as any mistake in
this whitelist of known files could then result in an inconsistent backup
that’s missing things that should have been included.

As for which approach is easier to maintain, I don’t see one as being
meaningfully much easier to maintain than the other, code wise, while
having these patterns looks to me as having a lot more risk, with the only
rather nebulous gain being that we don’t complain about extensions dropping
random files in places they shouldn’t have been- and which, if they had
actually been implemented in a way we’d be accepting of (I.e. pluggable
storage), we would have been tracking and handling appropriately.

So what I think we ought to do is the following:
> - Start a new thread, this one about TAP tests is not adapted.

I don’t have any objection to a new thread. I don’t know that it’ll really
get more people to comment, but perhaps it will.

Thanks!

Stephen

In response to

Responses

Browse pgsql-committers by date

  From Date Subject
Next Message Michael Paquier 2018-10-20 04:30:46 Re: pgsql: Add TAP tests for pg_verify_checksums
Previous Message Andres Freund 2018-10-20 03:50:04 Re: pgsql: Add TAP tests for pg_verify_checksums

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2018-10-20 04:30:46 Re: pgsql: Add TAP tests for pg_verify_checksums
Previous Message Laurenz Albe 2018-10-20 04:24:28 Re: Function to promote standby servers