Re: Online enabling of checksums

From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Daniel Gustafsson <daniel(at)yesql(dot)se>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Subject: Re: Online enabling of checksums
Date: 2018-03-31 15:05:28
Message-ID: CABUevEzP+OTr+U9u5ZsP8PFyVR6CaH4A1yn9pTk9cvdfeeJHrw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Mar 31, 2018 at 4:21 PM, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
wrote:

>
> On 03/31/2018 02:02 PM, Magnus Hagander wrote:
> > On Sat, Mar 31, 2018 at 2:08 AM, Tomas Vondra
> > <tomas(dot)vondra(at)2ndquadrant(dot)com <mailto:tomas(dot)vondra(at)2ndquadrant(dot)com>>
> wrote:
> >
> >
> > But wait - there is more ;-) BuildRelationList is using
> heap_beginscan
> > with the regular snapshot, so it does not see uncommitted
> transactions.
> > So if you do this:
> >
> > BEGIN;
> > CREATE TABLE t AS SELECT i FROM generate_series(1,10000000) s(i);
> > -- run pg_enable_data_checksums() from another session
> > SELECT COUNT(*) FROM t;
> >
> > then the table will be invisible to the checksum worker, it won't
> have
> > checksums updated and the cluster will get checksums enabled. Which
> > means this:
> >
> >
> > Ugh. Interestingly enough I just put that on my TODO list *yesterday*
> > that I forgot to check that specific case :/
> >
>
> But I was faster in reporting it ;-)
>

Indeed you were :)

> > test=# SELECT COUNT(*) FROM t;
> > WARNING: page verification failed, calculated checksum 27170 but
> > expected 0
> > ERROR: invalid page in block 0 of relation base/16677/16683
> >
> > Not sure what's the best way to fix this - maybe we could wait for
> all
> > running transactions to end, before starting the work.
> >
> >
> > I was thinking of that as one way to deal with it, yes.
> >
> > I guess a reasonable way to do that would be to do it as part of
> > BuildRelationList() -- basically have that one wait until there is no
> > other running transaction in that specific database before we take the
> > snapshot?
> >
> > A first thought I had was to try to just take an access exclusive lock
> > on pg_class for a very short time, but a transaction that does create
> > table doesn't actually keep it's lock on that table so there is no
> conflict.
> >
>
> Yeah, I don't think that's going to work, because you don't even know
> you need to lock/wait for something.

> I do think just waiting for all running transactions to complete is
> fine, and it's not the first place where we use it - CREATE SUBSCRIPTION
> does pretty much exactly the same thing (and CREATE INDEX CONCURRENTLY
> too, to some extent). So we have a precedent / working code we can copy.
>

Thinking again, I don't think it should be done as part of
BuildRelationList(). We should just do it once in the launcher before
starting, that'll be both easier and cleaner. Anything started after that
will have checksums on it, so we should be fine.

PFA one that does this.

> And if you try this with a temporary table (not hidden in transaction,
> > so the bgworker can see it), the worker will fail with this:
> >
> > ERROR: cannot access temporary tables of other sessions
> >
> > But of course, this is just another way how to crash without updating
> > the result for the launcher, so checksums may end up being enabled
> > anyway.
> >
> >
> > Yeah, there will be plenty of side-effect issues from that
> > crash-with-wrong-status case. Fixing that will at least make things
> > safer -- in that checksums won't be enabled when not put on all pages.
> >
>
> Sure, the outcome with checksums enabled incorrectly is a consequence of
> bogus status, and fixing that will prevent that. But that wasn't my main
> point here - not articulated very clearly, though.
>
> The bigger question is how to handle temporary tables gracefully, so
> that it does not terminate the bgworker like this at all. This might be
> even bigger issue than dropped relations, considering that temporary
> tables are pretty common part of applications (and it also includes
> CREATE/DROP).
>
> For some clusters it might mean the online checksum enabling would
> crash+restart infinitely (well, until reaching MAX_ATTEMPTS).
>
> Unfortunately, try_relation_open() won't fix this, as the error comes
> from ReadBufferExtended. And it's not a matter of simply creating a
> ReadBuffer variant without that error check, because temporary tables
> use local buffers.
>
> I wonder if we could just go and set the checksums anyway, ignoring the
> local buffers. If the other session does some changes, it'll overwrite
> our changes, this time with the correct checksums. But it seems pretty
> dangerous (I mean, what if they're writing stuff while we're updating
> the checksums? Considering the various short-cuts for temporary tables,
> I suspect that would be a boon for race conditions.)
>
> Another option would be to do something similar to running transactions,
> i.e. wait until all temporary tables (that we've seen at the beginning)
> disappear. But we're starting to wait on more and more stuff.
>
> If we do this, we should clearly log which backends we're waiting for,
> so that the admins can go and interrupt them manually.
>

Yeah, waiting for all transactions at the beginning is pretty simple.

Making the worker simply ignore temporary tables would also be easy.

One of the bigger issues here is temporary tables are *session* scope and
not transaction, so we'd actually need the other session to finish, not
just the transaction.

I guess what we could do is something like this:

1. Don't process temporary tables in the checksumworker, period. Instead,
build a list of any temporary tables that existed when the worker started
in this particular database (basically anything that we got in our scan).
Once we have processed the complete database, keep re-scanning pg_class
until those particular tables are gone (search by oid).

That means that any temporary tables that are created *while* we are
processing a database are ignored, but they should already be receiving
checksums.

It definitely leads to a potential issue with long running temp tables. But
as long as we look at the *actual tables* (by oid), we should be able to
handle long-running sessions once they have dropped their temp tables.

Does that sound workable to you?

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/>
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>

Attachment Content-Type Size
online_checksums8.patch text/x-patch 77.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-03-31 15:08:38 Re: Small proposal to improve out-of-memory messages
Previous Message David Rowley 2018-03-31 14:42:12 Re: Parallel Aggregates for string_agg and array_agg