Re: Online checksums patch - once again

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Daniel Gustafsson <daniel(at)yesql(dot)se>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Magnus Hagander <magnus(at)hagander(dot)net>, Andres Freund <andres(at)anarazel(dot)de>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Online checksums patch - once again
Date: 2020-01-03 22:07:49
Message-ID: CA+Tgmobw3iHprjbqOZ31Y7G2bfKFAEB8dPyX0153h00UrVa2yQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Dec 16, 2019 at 10:16 AM Daniel Gustafsson <daniel(at)yesql(dot)se> wrote:
> If reviewers think this version is nearing completion, then a v16 should
> address the comment below, but as this version switches its underlying
> infrastructure it seemed usefel for testing still.

I think this patch still needs a lot of work.

- doPageWrites = (Insert->fullPageWrites || Insert->forcePageWrites);
+ doPageWrites = (Insert->fullPageWrites || Insert->forcePageWrites ||
DataChecksumsInProgress());

This will have a small performance cost in a pretty hot code path. Not
sure that it's enough to worry about, though.

-DataChecksumsEnabled(void)
+DataChecksumsNeedWrite(void)
{
Assert(ControlFile != NULL);
return (ControlFile->data_checksum_version > 0);
}

This seems troubling, because data_checksum_version can now change,
but you're still accessing it without a lock. This complain applies
likewise to a bunch of related functions in xlog.c as well.

+ elog(ERROR, "Checksums not in inprogress mode");

Questionable capitalization and punctuation.

+void
+SetDataChecksumsOff(void)
+{
+ LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+
+ ControlFile->data_checksum_version = 0;
+ UpdateControlFile();
+ LWLockRelease(ControlFileLock);
+ WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM));
+
+ XlogChecksums(0);
+}

This looks racey. Suppose that checksums are on. Other backends will
see that checksums are disabled as soon as
ControlFile->data_checksum_version = 0 happens, and they will feel
free to write blocks without checksums. Now we crash, and those blocks
exist on disk even though the on-disk state still otherwise shows
checksums fully enabled. It's a little better if we stop reading
data_checksum_version without a lock, because now nobody else can see
the updated state until we've actually updated the control file. But
even then, isn't it strange that writes of non-checksummed stuff could
appear or be written to disk before XlogChecksums(0) happens? If
that's safe, it seems like it deserves some kind of comment.

+ /*
+ * If we reach this point with checksums in inprogress state, we notify
+ * the user that they need to manually restart the process to enable
+ * checksums. This is because we cannot launch a dynamic background worker
+ * directly from here, it has to be launched from a regular backend.
+ */
+ if (ControlFile->data_checksum_version == PG_DATA_CHECKSUM_INPROGRESS_VERSION)
+ ereport(WARNING,
+ (errmsg("checksum state is \"inprogress\" with no worker"),
+ errhint("Either disable or enable checksums by calling the
pg_disable_data_checksums() or pg_enable_data_checksums()
functions.")));

This seems pretty half-baked.

+ (errmsg("could not start checksumhelper: has been canceled")));
+ (errmsg("could not start checksumhelper: already running")));
+ (errmsg("failed to start checksum helper launcher")));

These seem rough around the edges. Using an internal term like
'checksumhelper' in a user-facing error message doesn't seem great.
Generally primary error messages are phrased as a single utterance
where we can, rather than colon-separated fragments like this. The
third message calls it 'checksum helper launcher' whereas the other
two call it 'checksumhelper'. It also isn't very helpful; I don't
think most people like a message saying that something failed with no
explanation given.

+ elog(DEBUG1, "Checksumhelper skipping relation %d as it no longer
exists", relationId);

Here's another way to spell 'checksumhelper', and this time it refers
to the worker rather than the launcher. Also, relation IDs are OIDs,
so need to be printed with %u, and usually we try to print names if
possible. Also, this message, like a lot of messages in this patch,
begins with a capital letter and does not end with a period. That is
neither the style for primary messages nor the style for detail
messages. As these are primary messages, the primary message style
should be used. That style is no capital and no period.

+ if (!RegisterDynamicBackgroundWorker(&bgw, &bgw_handle))
+ {
+ ereport(LOG,
+ (errmsg("failed to start worker for checksumhelper in \"%s\"",
+ db->dbname)));
+ return FAILED;
+ }

I don't think having constants with names like
SUCCESSFUL/ABORTED/FAILED is a very good idea. Too much chance of name
collisions. I suggest adding a prefix.

Also, the retry logic here doesn't look particularly robust.
RegisterDynamicBackgroundWorker will fail if all slots are available;
if that happens twice for the same database, once on first attempting
it and again when retrying it, the whole process fails, all state is
lost, and all work has to be redone. That seems neither particularly
unlikely nor pleasant.

+ if (DatabaseList == NIL || list_length(DatabaseList) == 0)

I don't think that the second half of this test serves any purpose.

+ snprintf(activity, sizeof(activity), "Waiting for current
transactions to finish (waiting for %d)", waitforxid);

%u here too.

+ if (pgc->relpersistence == 't')

Use the relevant constant.

+ /*
+ * Wait for all temp tables that existed when we started to go away. This
+ * is necessary since we cannot "reach" them to enable checksums. Any temp
+ * tables created after we started will already have checksums in them
+ * (due to the inprogress state), so those are safe.
+ */

This does not seem very nice. It just leaves a worker running more or
less forever. It's essentially waiting for temp-table using sessions
to go away, but that could take a really long time.

+ WAIT_EVENT_PG_SLEEP);

You need to invent a new wait event and add docs for it.

+ if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_CHECKSUM))
+ {
+ /*
+ * By virtue of getting here (i.e. interrupts being processed), we
+ * know that this backend won't have any in-progress writes (which
+ * might have missed the checksum change).
+ */
+ }

I don't believe this. I already wrote some about this over here:

http://postgr.es/m/CA+TgmobORrsgSUydZ3MsSw9L5MBUGz7jRK+973uPZgiyCQ81ag@mail.gmail.com

As a general point, I think that the idea of the ProcSignalBarrier
mechanism is that every backend has some private state that needs to
be updated, and when it absorbs the barrier you know that it's updated
that state, and so when everybody's absorbed the barrier you know that
all the state has been updated. Here, however, there actually is no
backend-private state. The only state that everyone's consulting is
the shared state stored in ControlFile->data_checksum_version. So what
does absorbing the barrier prove? Only that we've reached a
CHECK_FOR_INTERRUPTS(). But that is a useful guarantee only if we
never check for interrupts between the time we examine
ControlFile->data_checksum_version and the time we use it, and I see
no particular reason to believe that should always be true, and I
suspect it isn't, and even if it happens to be true today I think it
could get broken in the future pretty easily. There are no particular
restrictions documented in terms of where DataChecksumsNeedWrite(),
XLogHintBitIsNeeded(), etc. can be checked or what can be done between
checking the value and using it. The issue doesn't arise for today's
DataChecksumsEnabled() because the value can't ever change, but with
this patch things can change, and to me what the patch does about that
doesn't really look adequate.

I'm sort of out of time for right now, but I think this patch needs a
lot more work on the concurrency end of things. It seems to me that it
probably mostly works in practice, but that the whole concurrency
mechanism is not very solid and probably has a lot of rare cases where
it can just misbehave if you get unlucky. I'll try to spend some more
time thinking about this next week. I also think that the fact that
the mechanism for getting from 'inprogress' to 'on' seems fragile and
under-engineered. It still bothers me that there's no mechanism for
persisting the progress that we've made in enabling checksums; but
even apart from that, I think that there just hasn't been enough
thought given here to error/problem cases.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2020-01-03 22:09:37 Re: Greatest Common Divisor
Previous Message Daniel Gustafsson 2020-01-03 21:57:54 Re: Removal of support for OpenSSL 0.9.8 and 1.0.0