Re: Online enabling of checksums

From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Daniel Gustafsson <daniel(at)yesql(dot)se>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Subject: Re: Online enabling of checksums
Date: 2018-03-31 12:02:54
Message-ID: CABUevEz=toBTM2LTVS_pqe69FKLq79DOi=Bf76ybVfJjK93ThA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Mar 31, 2018 at 2:08 AM, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
wrote:

> Hi,
>
> I've been looking at the patch a bit more, and I think there are a
> couple of fairly serious issues in the error handling.
>

Thanks!

>
> Firstly ChecksumHelperLauncherMain spends quite a bit of effort on
> skipping dropped databases, but ChecksumHelperWorkerMain does not do the
> same thing with tables. I'm not exactly sure why, but I'd say dropped
> tables are more likely than dropped databases (e.g. because of temporary
> tables) and it's strange to gracefully handle the more rare case.
>

Uh, yes. I could've sworn we had code for that, but I fully agree with your
assessment that it's not there :)

Now, when a table gets dropped after BuildRelationList() does it's work,
> we end up calling ProcessSingleRelationByOid() on that OID. Which calls
> relation_open(), which fails with elog(ERROR), terminating the whole
> bgworker with an error like this:
>
> ERROR: could not open relation with OID 16632
> LOG: background worker "checksumhelper worker" (PID 27152) exited
> with exit code 1
>

Yeah. We need to trap that error an not crash and burn.

Which however means the error handling in ChecksumHelperWorkerMain() has
> no chance to kick in, because the bgworker dies right away. The code
> looks like this:
>
> foreach(lc, RelationList)
> {
> ChecksumHelperRelation *rel
> = (ChecksumHelperRelation *) lfirst(lc);
>
> if (!ProcessSingleRelationByOid(rel->reloid, strategy))
> {
> ChecksumHelperShmem->success = ABORTED;
> break;
> }
> else
> ChecksumHelperShmem->success = SUCCESSFUL;
> }
> list_free_deep(RelationList);
>
> Now, assume the first relation in the list still exists and gets
> processed correctly, so "success" ends up being SUCCESSFUL. Then the
> second OID is the dropped relation, which kills the bgworker ...
>

Indeed, that's just a very simple bug. It shouldn't be set to SUCCESSFUL
until *all* tables have been processed. I believe the code needs to be this:

IMHO this error handling is broken by design - two things need to
> happen, I think: (a) graceful handling of dropped relations and (b)
> proper error reporting from the bgworder.
>
> (a) Should not be difficult to do, I think. We don't have relation_open
> with a missing_ok flag, but implementing something like that should not
> be difficult. Even a simple "does OID exist" should be enough.
>

Not entirely sure what you mean with "even a simple does oid exist" means?
I mean, if we check for the file, that won't help us -- there will still be
a tiny race between the check and us opening it won't it?

However, we have try_relation_open(). Which is documented as:
* Same as relation_open, except return NULL instead of failing
* if the relation does not exist.

So I'm pretty sure it's just a matter of using try_relation_open() instead,
and checking for NULL?

(b) But just handling dropped relations is not enough, because I could
> simply kill the bgworker directly, and it would have exactly the same
> consequences. What needs to happen is something like this:
>

<snip>
And now I see your code, which was below-fold when I first read. After
having writing a very similar fix myself. I'm glad this code looks mostly
identical to what I suggested above, so I think that's a good solution.

> BTW I don't think handling dropped relations by letting the bgworker
> crash and restart is an acceptable approach. That would pretty much mean
> any DDL changes are prohibited on the system while the checksum process
> is running, which is not quite possible (e.g. for systems doing stuff
> with temporary tables).
>

No, I don't like that at all. We need to handle them gracefully, by
skipping them - crash and restart is not acceptable for something that
common.

Which however reminds me I've also ran into a bug in the automated retry
> system, because you may get messages like this:
>
> ERROR: failed to enable checksums in "test", giving up (attempts
> 639968292).
>
> This happens because BuildDatabaseList() does just palloc() and does not
> initialize the 'attempts' field. It may get initialized to 0 by chance,
> but I'm running with -DRANDOMIZE_ALLOCATED_MEMORY, hence the insanely
> high value.
>

Eh. I don't have that "(attempts" part in my code at all. Is that either
from some earlier version of the patch, or did you add that yourself for
testing?

BTW both ChecksumHelperRelation and ChecksumHelperDatabase have
> 'success' field which is actually unused (and uninitialized).
>

Correct. These are old leftovers from the "partial restart" logic, and
should be removed.

But wait - there is more ;-) BuildRelationList is using heap_beginscan
> with the regular snapshot, so it does not see uncommitted transactions.
> So if you do this:
>
> BEGIN;
> CREATE TABLE t AS SELECT i FROM generate_series(1,10000000) s(i);
> -- run pg_enable_data_checksums() from another session
> SELECT COUNT(*) FROM t;
>
> then the table will be invisible to the checksum worker, it won't have
> checksums updated and the cluster will get checksums enabled. Which
> means this:
>

Ugh. Interestingly enough I just put that on my TODO list *yesterday* that
I forgot to check that specific case :/

test=# SELECT COUNT(*) FROM t;
> WARNING: page verification failed, calculated checksum 27170 but
> expected 0
> ERROR: invalid page in block 0 of relation base/16677/16683
>
> Not sure what's the best way to fix this - maybe we could wait for all
> running transactions to end, before starting the work.
>

I was thinking of that as one way to deal with it, yes.

I guess a reasonable way to do that would be to do it as part of
BuildRelationList() -- basically have that one wait until there is no other
running transaction in that specific database before we take the snapshot?

A first thought I had was to try to just take an access exclusive lock on
pg_class for a very short time, but a transaction that does create table
doesn't actually keep it's lock on that table so there is no conflict.

And if you try this with a temporary table (not hidden in transaction,
> so the bgworker can see it), the worker will fail with this:
>
> ERROR: cannot access temporary tables of other sessions
>
> But of course, this is just another way how to crash without updating
> the result for the launcher, so checksums may end up being enabled anyway.
>

Yeah, there will be plenty of side-effect issues from that
crash-with-wrong-status case. Fixing that will at least make things safer
-- in that checksums won't be enabled when not put on all pages.

I have attached a patch that fixes the "easy" ones per your first comments.
No solution for the open-transaction yet, but I wanted to put the rest out
there -- especially if you have automated tests you can send it through.

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/>
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>

Attachment Content-Type Size
online_checksums7.patch text/x-patch 76.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2018-03-31 12:13:58 Re: Problem while setting the fpw with SIGHUP
Previous Message Félix GERZAGUET 2018-03-31 11:17:04 Re: bulk typos