Re: Online checksums patch - once again

From: Daniel Gustafsson <daniel(at)yesql(dot)se>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Magnus Hagander <magnus(at)hagander(dot)net>, Andres Freund <andres(at)anarazel(dot)de>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Online checksums patch - once again
Date: 2020-01-18 23:18:18
Message-ID: BA7F5F48-9DCF-408A-8EDC-39824535ECF1@yesql.se
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On 3 Jan 2020, at 23:07, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Mon, Dec 16, 2019 at 10:16 AM Daniel Gustafsson <daniel(at)yesql(dot)se> wrote:
>> If reviewers think this version is nearing completion, then a v16 should
>> address the comment below, but as this version switches its underlying
>> infrastructure it seemed usefel for testing still.
>
> I think this patch still needs a lot of work.

Thanks a lot for your thorough review, much appreciated! Also, sorry for being
slow to respond. Below are fixes and responses to most of the feedback, but I
need a bit more time to think about the concurrency aspects that you brought
up. However, in the spirit of showing work early/often I opted for still
sending the partial response, to perhaps be able to at least close some of the
raised issues in the meantime.

> - doPageWrites = (Insert->fullPageWrites || Insert->forcePageWrites);
> + doPageWrites = (Insert->fullPageWrites || Insert->forcePageWrites ||
> DataChecksumsInProgress());
>
> This will have a small performance cost in a pretty hot code path. Not
> sure that it's enough to worry about, though.

Not sure either, and/or how clever compilers are about inlining this. As a
test, I've switched over this to be a static inline function, as it's only
consumer is in xlog.c. Now, as mentioned later in this review, reading the
version unlocked has issues so do consider this a WIP test, not a final
suggestion.

> -DataChecksumsEnabled(void)
> +DataChecksumsNeedWrite(void)
> {
> Assert(ControlFile != NULL);
> return (ControlFile->data_checksum_version > 0);
> }
>
> This seems troubling, because data_checksum_version can now change,
> but you're still accessing it without a lock. This complain applies
> likewise to a bunch of related functions in xlog.c as well.

Right, let me do some more thinking on this before addressing in a next version
of the patch. Simply wrapping it in a SHARED lock still has TOCTOU problems so
a bit more work/thinking is needed.

> + elog(ERROR, "Checksums not in inprogress mode");
>
> Questionable capitalization and punctuation.

Fixed capitalization, but elogs shouldn't end with a period so left that.

> +void
> +SetDataChecksumsOff(void)
> +{
> + LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
> +
> + ControlFile->data_checksum_version = 0;
> + UpdateControlFile();
> + LWLockRelease(ControlFileLock);
> + WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM));
> +
> + XlogChecksums(0);
> +}
>
> This looks racey. Suppose that checksums are on. Other backends will
> see that checksums are disabled as soon as
> ControlFile->data_checksum_version = 0 happens, and they will feel
> free to write blocks without checksums. Now we crash, and those blocks
> exist on disk even though the on-disk state still otherwise shows
> checksums fully enabled. It's a little better if we stop reading
> data_checksum_version without a lock, because now nobody else can see
> the updated state until we've actually updated the control file. But
> even then, isn't it strange that writes of non-checksummed stuff could
> appear or be written to disk before XlogChecksums(0) happens? If
> that's safe, it seems like it deserves some kind of comment.

As mentioned above, I would like to address this in the next version. I'm
working on it, just need a little more time and wanted to share progress on the
other bits.

> + /*
> + * If we reach this point with checksums in inprogress state, we notify
> + * the user that they need to manually restart the process to enable
> + * checksums. This is because we cannot launch a dynamic background worker
> + * directly from here, it has to be launched from a regular backend.
> + */
> + if (ControlFile->data_checksum_version == PG_DATA_CHECKSUM_INPROGRESS_VERSION)
> + ereport(WARNING,
> + (errmsg("checksum state is \"inprogress\" with no worker"),
> + errhint("Either disable or enable checksums by calling the
> pg_disable_data_checksums() or pg_enable_data_checksums()
> functions.")));
>
> This seems pretty half-baked.

I don't disagree with that. However, given that enabling checksums is a pretty
intensive operation it seems somewhat unfriendly to automatically restart. As
a DBA I wouldn't want that to kick off without manual intervention, but there
is also the risk of this being missed due to assumptions that it would restart.
Any ideas on how to treat this?

If/when we can restart the processing where it left off, without the need to go
over all data again, things might be different wrt the default action.

> + (errmsg("could not start checksumhelper: has been canceled")));
> + (errmsg("could not start checksumhelper: already running")));
> + (errmsg("failed to start checksum helper launcher")));
>
> These seem rough around the edges. Using an internal term like
> 'checksumhelper' in a user-facing error message doesn't seem great.
> Generally primary error messages are phrased as a single utterance
> where we can, rather than colon-separated fragments like this. The
> third message calls it 'checksum helper launcher' whereas the other
> two call it 'checksumhelper'. It also isn't very helpful; I don't
> think most people like a message saying that something failed with no
> explanation given.
>
> + elog(DEBUG1, "Checksumhelper skipping relation %d as it no longer
> exists", relationId);
>
> Here's another way to spell 'checksumhelper', and this time it refers
> to the worker rather than the launcher. Also, relation IDs are OIDs,
> so need to be printed with %u, and usually we try to print names if
> possible. Also, this message, like a lot of messages in this patch,
> begins with a capital letter and does not end with a period. That is
> neither the style for primary messages nor the style for detail
> messages. As these are primary messages, the primary message style
> should be used. That style is no capital and no period.

I've removed checksumhelper from all user facing strings, and only kept them in
the DEBUG strings (which to some extent probably will be removed before a final
version of the patch, so didn't spend too much time on those just now). The
bgworker name is still checksumhelper launcher and checksumhelper worker, but
that could perhaps do with a better name too.

> + if (!RegisterDynamicBackgroundWorker(&bgw, &bgw_handle))
> + {
> + ereport(LOG,
> + (errmsg("failed to start worker for checksumhelper in \"%s\"",
> + db->dbname)));
> + return FAILED;
> + }
>
> I don't think having constants with names like
> SUCCESSFUL/ABORTED/FAILED is a very good idea. Too much chance of name
> collisions. I suggest adding a prefix.

Fixed.

> Also, the retry logic here doesn't look particularly robust.
> RegisterDynamicBackgroundWorker will fail if all slots are available;
> if that happens twice for the same database, once on first attempting
> it and again when retrying it, the whole process fails, all state is
> lost, and all work has to be redone. That seems neither particularly
> unlikely nor pleasant.

Agreed, this was a brick or two shy of a load. I've rewritten this logic to
cope better with the conditions around startup/shutdown of bgworkers. I think
there should be some form of backoff mechanism as well in case there
temporarily aren't any slots, to avoid running through all the databases in
short order only to run up the retry counter. Something like if X databases in
succession fail on no slot being available, back off a little before trying X+1
to allow for operations that consume the slot(s) to finish. Or something. It
wont help for systems which are permanently starved with a too low
max_worker_processes, but nothing sort of will. For the latter, I've added a
note to the documentation.

> + if (DatabaseList == NIL || list_length(DatabaseList) == 0)
>
> I don't think that the second half of this test serves any purpose.

True, but I think the code is clearer if the second half is the one we keep, so
went with that.

> + snprintf(activity, sizeof(activity), "Waiting for current
> transactions to finish (waiting for %d)", waitforxid);
>
> %u here too.

Fixed.

> + if (pgc->relpersistence == 't')
>
> Use the relevant constant.

Fixed.

> + /*
> + * Wait for all temp tables that existed when we started to go away. This
> + * is necessary since we cannot "reach" them to enable checksums. Any temp
> + * tables created after we started will already have checksums in them
> + * (due to the inprogress state), so those are safe.
> + */
>
> This does not seem very nice. It just leaves a worker running more or
> less forever. It's essentially waiting for temp-table using sessions
> to go away, but that could take a really long time.

It can, but is there a realistic alternative? I can't think of one but if you
have ideas I'd love for this requirement to go away, or be made less blocking.

At the same time, enabling checksums is hardly the kind of operation one does
casually in a busy database, but probably a more planned operation. This
requirement is mentioned in the documentation such that a DBA can plan for when
to start the processing.

> + WAIT_EVENT_PG_SLEEP);
>
> You need to invent a new wait event and add docs for it.

Done. I failed to figure out a (IMO) good name though, and welcome suggestions
that are more descriptive. CHECKSUM_ENABLE_STARTCONDITION was what I settled on
but I'm not too excited about it.

> + if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_CHECKSUM))
> + {
> + /*
> + * By virtue of getting here (i.e. interrupts being processed), we
> + * know that this backend won't have any in-progress writes (which
> + * might have missed the checksum change).
> + */
> + }
>
> I don't believe this. I already wrote some about this over here:
>
> http://postgr.es/m/CA+TgmobORrsgSUydZ3MsSw9L5MBUGz7jRK+973uPZgiyCQ81ag@mail.gmail.com
>
> As a general point, I think that the idea of the ProcSignalBarrier
> mechanism is that every backend has some private state that needs to
> be updated, and when it absorbs the barrier you know that it's updated
> that state, and so when everybody's absorbed the barrier you know that
> all the state has been updated. Here, however, there actually is no
> backend-private state. The only state that everyone's consulting is
> the shared state stored in ControlFile->data_checksum_version. So what
> does absorbing the barrier prove? Only that we've reached a
> CHECK_FOR_INTERRUPTS(). But that is a useful guarantee only if we
> never check for interrupts between the time we examine
> ControlFile->data_checksum_version and the time we use it, and I see
> no particular reason to believe that should always be true, and I
> suspect it isn't, and even if it happens to be true today I think it
> could get broken in the future pretty easily. There are no particular
> restrictions documented in terms of where DataChecksumsNeedWrite(),
> XLogHintBitIsNeeded(), etc. can be checked or what can be done between
> checking the value and using it. The issue doesn't arise for today's
> DataChecksumsEnabled() because the value can't ever change, but with
> this patch things can change, and to me what the patch does about that
> doesn't really look adequate.

I don't disagree with this, but I need to do a bit more thinking before
presenting a suggested fix for this concurrency issue.

> I'm sort of out of time for right now, but I think this patch needs a
> lot more work on the concurrency end of things. It seems to me that it
> probably mostly works in practice, but that the whole concurrency
> mechanism is not very solid and probably has a lot of rare cases where
> it can just misbehave if you get unlucky. I'll try to spend some more
> time thinking about this next week. I also think that the fact that
> the mechanism for getting from 'inprogress' to 'on' seems fragile and
> under-engineered. It still bothers me that there's no mechanism for
> persisting the progress that we've made in enabling checksums; but
> even apart from that, I think that there just hasn't been enough
> thought given here to error/problem cases.

Thanks again for reviewing (and working on the infrastructure required for this
patch to begin with)! Regarding the persisting the progress; that would be a
really neat feature but I don't have any suggestion on how to do that safely
for real use-cases.

Attached is a v16 rebased on top of current master which addresses the above
commented points, and which I am basing the concurrency work on.

cheers ./daniel

Attachment Content-Type Size
online_checksums16.patch application/octet-stream 72.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2020-01-18 23:22:00 Re: should crash recovery ignore checkpoint_flush_after ?
Previous Message Tom Lane 2020-01-18 23:05:49 Re: Patch to document base64 encoding