Re: Online enabling of checksums

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Daniel Gustafsson <daniel(at)yesql(dot)se>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Subject: Re: Online enabling of checksums
Date: 2018-03-31 14:21:47
Message-ID: 87631c42-a5cf-304f-3ac4-b30c058f6153@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 03/31/2018 02:02 PM, Magnus Hagander wrote:
> On Sat, Mar 31, 2018 at 2:08 AM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com <mailto:tomas(dot)vondra(at)2ndquadrant(dot)com>> wrote:
>
> ...
>
> (a) Should not be difficult to do, I think. We don't have relation_open
> with a missing_ok flag, but implementing something like that should not
> be difficult. Even a simple "does OID exist" should be enough.
>
>
> Not entirely sure what you mean with "even a simple does oid exist"
> means? I mean, if we check for the file, that won't help us -- there
> will still be a tiny race between the check and us opening it won't it?
>

I meant to say "even a simple check if the OID still exists" but it was
a bit too late / not enough caffeine issue. You're right there would be
a tiny window of race condition - it'd be much shorter, possibly enough
to make the error+restart approach acceptable.

> However, we have try_relation_open().  Which is documented as:
>  *Same as relation_open, except return NULL instead of failing
>  *if the relation does not exist.
>
> So I'm pretty sure it's just a matter of using try_relation_open()
> instead, and checking for NULL?
>

Oh, right. I thought we had a relation_open variant that handles this,
but have been looking for one with missing_ok flag and so I missed this.
try_relation_open should do the trick when it comes to dropped tables.

>
> (b) But just handling dropped relations is not enough, because I could
> simply kill the bgworker directly, and it would have exactly the same
> consequences. What needs to happen is something like this:
>
>
> <snip>
> And now I see your code, which was below-fold when I first read. After
> having writing a very similar fix myself. I'm glad this code looks
> mostly identical to what I suggested above, so I think that's a good
> solution.
>

Good ;-)

>  
>
> BTW I don't think handling dropped relations by letting the bgworker
> crash and restart is an acceptable approach. That would pretty much mean
> any DDL changes are prohibited on the system while the checksum process
> is running, which is not quite possible (e.g. for systems doing stuff
> with temporary tables).
>
>
> No, I don't like that at all. We need to handle them gracefully, by
> skipping them - crash and restart is not acceptable for something that
> common.
>

Yeah, I was just thinking aloud.

>
> Which however reminds me I've also ran into a bug in the automated retry
> system, because you may get messages like this:
>
>     ERROR:  failed to enable checksums in "test", giving up (attempts
>             639968292).
>
> This happens because BuildDatabaseList() does just palloc() and does not
> initialize the 'attempts' field. It may get initialized to 0 by chance,
> but I'm running with -DRANDOMIZE_ALLOCATED_MEMORY, hence the insanely
> high value.
>
>
> Eh. I don't have that "(attempts" part in my code at all. Is that either
> from some earlier version of the patch, or did you add that yourself for
> testing?
>

Apologies, you're right I tweaked the message a bit (just adding the
number of attempts to it). The logic however remains the same, and the
bug is real.

>
> But wait - there is more ;-) BuildRelationList is using heap_beginscan
> with the regular snapshot, so it does not see uncommitted transactions.
> So if you do this:
>
>   BEGIN;
>   CREATE TABLE t AS SELECT i FROM generate_series(1,10000000) s(i);
>   -- run pg_enable_data_checksums() from another session
>   SELECT COUNT(*) FROM t;
>
> then the table will be invisible to the checksum worker, it won't have
> checksums updated and the cluster will get checksums enabled. Which
> means this:
>
>
> Ugh. Interestingly enough I just put that on my TODO list *yesterday*
> that I forgot to check that specific case :/
>

But I was faster in reporting it ;-)

>
>   test=# SELECT COUNT(*) FROM t;
>   WARNING:  page verification failed, calculated checksum 27170 but
>             expected 0
>   ERROR:  invalid page in block 0 of relation base/16677/16683
>
> Not sure what's the best way to fix this - maybe we could wait for all
> running transactions to end, before starting the work.
>
>
> I was thinking of that as one way to deal with it, yes.
>
> I guess a reasonable way to do that would be to do it as part of
> BuildRelationList() -- basically have that one wait until there is no
> other running transaction in that specific database before we take the
> snapshot?
>
> A first thought I had was to try to just take an access exclusive lock
> on pg_class for a very short time, but a transaction that does create
> table doesn't actually keep it's lock on that table so there is no conflict.
>

Yeah, I don't think that's going to work, because you don't even know
you need to lock/wait for something.

I do think just waiting for all running transactions to complete is
fine, and it's not the first place where we use it - CREATE SUBSCRIPTION
does pretty much exactly the same thing (and CREATE INDEX CONCURRENTLY
too, to some extent). So we have a precedent / working code we can copy.

>
> And if you try this with a temporary table (not hidden in transaction,
> so the bgworker can see it), the worker will fail with this:
>
>   ERROR:  cannot access temporary tables of other sessions
>
> But of course, this is just another way how to crash without updating
> the result for the launcher, so checksums may end up being enabled
> anyway.
>
>
> Yeah, there will be plenty of side-effect issues from that
> crash-with-wrong-status case. Fixing that will at least make things
> safer -- in that checksums won't be enabled when not put on all pages. 
>

Sure, the outcome with checksums enabled incorrectly is a consequence of
bogus status, and fixing that will prevent that. But that wasn't my main
point here - not articulated very clearly, though.

The bigger question is how to handle temporary tables gracefully, so
that it does not terminate the bgworker like this at all. This might be
even bigger issue than dropped relations, considering that temporary
tables are pretty common part of applications (and it also includes
CREATE/DROP).

For some clusters it might mean the online checksum enabling would
crash+restart infinitely (well, until reaching MAX_ATTEMPTS).

Unfortunately, try_relation_open() won't fix this, as the error comes
from ReadBufferExtended. And it's not a matter of simply creating a
ReadBuffer variant without that error check, because temporary tables
use local buffers.

I wonder if we could just go and set the checksums anyway, ignoring the
local buffers. If the other session does some changes, it'll overwrite
our changes, this time with the correct checksums. But it seems pretty
dangerous (I mean, what if they're writing stuff while we're updating
the checksums? Considering the various short-cuts for temporary tables,
I suspect that would be a boon for race conditions.)

Another option would be to do something similar to running transactions,
i.e. wait until all temporary tables (that we've seen at the beginning)
disappear. But we're starting to wait on more and more stuff.

If we do this, we should clearly log which backends we're waiting for,
so that the admins can go and interrupt them manually.

> I have attached a patch that fixes the "easy" ones per your first
> comments. No solution for the open-transaction yet, but I wanted to put
> the rest out there -- especially if you have automated tests you can
> send it through.
>

I don't have automated tests, but I'll take a look.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2018-03-31 14:42:12 Re: Parallel Aggregates for string_agg and array_agg
Previous Message Craig Ringer 2018-03-31 14:13:42 Re: Feature Request - DDL deployment with logical replication