Re: Online enabling of checksums

From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Daniel Gustafsson <daniel(at)yesql(dot)se>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Subject: Re: Online enabling of checksums
Date: 2018-04-05 09:07:26
Message-ID: CABUevEwFKztSq=v9W157sYfund2AkTkk1bqrY65zM_y82c5NiA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

On Wed, Apr 4, 2018 at 12:11 AM, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
wrote:

> On 04/03/2018 02:05 PM, Magnus Hagander wrote:
> > On Sun, Apr 1, 2018 at 2:04 PM, Magnus Hagander <magnus(at)hagander(dot)net
> > <mailto:magnus(at)hagander(dot)net>> wrote:
> >
> > On Sat, Mar 31, 2018 at 5:38 PM, Tomas Vondra
> > <tomas(dot)vondra(at)2ndquadrant(dot)com <mailto:tomas(dot)vondra(at)2ndquadrant(dot)com>>
> > wrote:
> > > And if you try this with a temporary table (not
> > hidden in transaction,
> > > > so the bgworker can see it), the worker will fail
> > with this:
> > > >
> > > > ERROR: cannot access temporary tables of other
> > sessions
> > > >
> > > > But of course, this is just another way how to crash
> > without updating
> > > > the result for the launcher, so checksums may end up
> > being enabled
> > > > anyway.
> > > >
> > > >
> > > > Yeah, there will be plenty of side-effect issues from
> that
> > > > crash-with-wrong-status case. Fixing that will at least
> > make things
> > > > safer -- in that checksums won't be enabled when not put
> > on all pages.
> > > >
> > >
> > > Sure, the outcome with checksums enabled incorrectly is a
> > consequence of
> > > bogus status, and fixing that will prevent that. But that
> > wasn't my main
> > > point here - not articulated very clearly, though.
> > >
> > > The bigger question is how to handle temporary tables
> > gracefully, so
> > > that it does not terminate the bgworker like this at all.
> > This might be
> > > even bigger issue than dropped relations, considering that
> > temporary
> > > tables are pretty common part of applications (and it also
> > includes
> > > CREATE/DROP).
> > >
> > > For some clusters it might mean the online checksum
> > enabling would
> > > crash+restart infinitely (well, until reaching
> MAX_ATTEMPTS).
> > >
> > > Unfortunately, try_relation_open() won't fix this, as the
> > error comes
> > > from ReadBufferExtended. And it's not a matter of simply
> > creating a
> > > ReadBuffer variant without that error check, because
> > temporary tables
> > > use local buffers.
> > >
> > > I wonder if we could just go and set the checksums anyway,
> > ignoring the
> > > local buffers. If the other session does some changes,
> > it'll overwrite
> > > our changes, this time with the correct checksums. But it
> > seems pretty
> > > dangerous (I mean, what if they're writing stuff while
> > we're updating
> > > the checksums? Considering the various short-cuts for
> > temporary tables,
> > > I suspect that would be a boon for race conditions.)
> > >
> > > Another option would be to do something similar to running
> > transactions,
> > > i.e. wait until all temporary tables (that we've seen at
> > the beginning)
> > > disappear. But we're starting to wait on more and more
> stuff.
> > >
> > > If we do this, we should clearly log which backends we're
> > waiting for,
> > > so that the admins can go and interrupt them manually.
> > >
> > >
> > >
> > > Yeah, waiting for all transactions at the beginning is pretty
> > simple.
> > >
> > > Making the worker simply ignore temporary tables would also be
> > easy.
> > >
> > > One of the bigger issues here is temporary tables are
> > *session* scope
> > > and not transaction, so we'd actually need the other session
> > to finish,
> > > not just the transaction.
> > >
> > > I guess what we could do is something like this:
> > >
> > > 1. Don't process temporary tables in the checksumworker,
> period.
> > > Instead, build a list of any temporary tables that existed
> > when the
> > > worker started in this particular database (basically anything
> > that we
> > > got in our scan). Once we have processed the complete
> > database, keep
> > > re-scanning pg_class until those particular tables are gone
> > (search by oid).
> > >
> > > That means that any temporary tables that are created *while*
> > we are
> > > processing a database are ignored, but they should already be
> > receiving
> > > checksums.
> > >
> > > It definitely leads to a potential issue with long running
> > temp tables.
> > > But as long as we look at the *actual tables* (by oid), we
> > should be
> > > able to handle long-running sessions once they have dropped
> > their temp
> > > tables.
> > >
> > > Does that sound workable to you?
> > >
> >
> > Yes, that's pretty much what I meant by 'wait until all
> > temporary tables
> > disappear'. Again, we need to make it easy to determine which
> > OIDs are
> > we waiting for, which sessions may need DBA's attention.
> >
> > I don't think it makes sense to log OIDs of the temporary
> > tables. There
> > can be many of them, and in most cases the connection/session is
> > managed
> > by the application, so the only thing you can do is kill the
> > connection.
> >
> >
> > Yeah, agreed. I think it makes sense to show the *number* of temp
> > tables. That's also a predictable amount of information -- logging
> > all temp tables may as you say lead to an insane amount of data.
> >
> > PFA a patch that does this. I've also added some docs for it.
> >
> > And I also noticed pg_verify_checksums wasn't installed, so fixed
> > that too.
> >
> >
> > PFA a rebase on top of the just committed verify-checksums patch.
> >
>
> This seems OK in terms of handling errors in the worker and passing it
> to the launcher. I haven't managed to do any crash testing today, but
> code-wise it seems sane.
>
> It however still fails to initialize the attempts field after allocating
> the db entry in BuildDatabaseList, so if you try running with
> -DRANDOMIZE_ALLOCATED_MEMORY it'll get initialized to values like this:
>
> WARNING: attempts = -1684366952
> WARNING: attempts = 1010514489
> WARNING: attempts = -1145390664
> WARNING: attempts = 1162101570
>
> I guess those are not the droids we're looking for?
>

When looking at that and after a quick discussion, we just decided it's
better to completely remove the retry logic. As you mentioned in some
earlier mail, we had all this logic to retry databases (unlikely) but not
relations (likely). Attached patch simplifies it to only detect the
"database was dropped" case (which is fine), and consider every other kind
of failure a permanent one and just not turn on checksums in those cases.

Likewise, I don't see where ChecksumHelperShmemStruct->abort gets
> initialized. I think it only ever gets set in launcher_exit(), but that
> does not seem sufficient. I suspect it's the reason for this behavior:
>

It's supposed to get initialized in ChecksumHelperShmemInit() -- fixed.
(The whole memset-to-zero)

test=# select pg_enable_data_checksums(10, 10);
> ERROR: database "template0" does not allow connections
> HINT: Allow connections using ALTER DATABASE and try again.
> test=# alter database template0 allow_connections = true;
> ALTER DATABASE
> test=# select pg_enable_data_checksums(10, 10);
> ERROR: could not start checksumhelper: already running
> test=# select pg_disable_data_checksums();
> pg_disable_data_checksums
> ---------------------------
>
> (1 row)
>
> test=# select pg_enable_data_checksums(10, 10);
> ERROR: could not start checksumhelper: has been cancelled
>

Turns out that wasn't the problem. The problem was that we *set* it before
erroring out with the "does not allow connections", but never cleared it.
In that case, it would be listed as launcher-is-running even though the
launcher was never started.

Basically the check for datallowconn was put in the wrong place. That check
should go away completely once we merge (because we should also merge the
part that allows us to bypass it), but for now I have moved the check to
the correct place.

At which point the only thing you can do is restarting the cluster,
> which seems somewhat unnecessary. But perhaps it's intentional?
>

Not at all.

Attached is a diff with a couple of minor comment tweaks, and correct
> initialization of the attempts field.
>

Thanks. This is included in the attached update, along with the above fixes
and some other small touches from Daniel.

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/>
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>

Attachment Content-Type Size
online_checksums10.patch text/x-patch 81.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro HORIGUCHI 2018-04-05 09:17:30 Re: [HACKERS] GUC for cleanup indexes threshold.
Previous Message Craig Ringer 2018-04-05 08:46:08 Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS