Re: Online checksums patch - once again

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Daniel Gustafsson <daniel(at)yesql(dot)se>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, Michael Paquier <michael(at)paquier(dot)xyz>, Andres Freund <andres(at)anarazel(dot)de>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Online checksums patch - once again
Date: 2020-10-05 11:36:39
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I looked at patch v22, and I can see two main issues:

1. The one that Robert talked about earlier: A backend checks the local
"checksums" state. If it's 'off', it writes a page without checksums.
How do you guarantee that the local state doesn't change in between? The
implicit assumption seems to be that there MUST NOT be any
CHECK_FOR_INTERRUPTS() calls between DataChecksumsNeedWrite() and the
write (or read and DataChecksumsNeedVerify()).

In most code, the DataChecksumsNeedWrite() call is very close to writing
out the page, often in the same critical section. But this is an
undocumented assumption.

The code in sendFile() in basebackup.c seems suspicious in that regard.
It calls DataChecksumsNeedVerify() once before starting to read the
file. Isn't it possible for the checksums flag to change while it's
reading the file and sending it to the client? I hope there are
CHECK_FOR_INTERRUPTS() calls buried somewhere in the loop, because it
could take minutes to send the whole file.

I would feel better if the state transition of the "checksums" flag
could only happen in a few safe places, or there were some other
safeguards for this. I think that's what Andres was trying to say
earlier in the thread on ProcSignalBarriers. I'm not sure what the
interface to that should be. It could be something like
HOLD/RESUME_INTERRUPTS(), where normally all procsignals are handled on
CHECK_FOR_INTERRUPTS(), but you could "hold off" some if needed. Or
something else. Or maybe we can just use HOLD/RESUME_INTERRUPTS() for
this. It's more coarse-grained than necessary, but probably doesn't
matter in practice.

At minimum, there needs to be comments in DataChecksumsNeedWrite() and
DataChecksumsNeedVerify(), instructing how to use them safely. Namely,
you must ensure that there are no interrupts between the
DataChecksumsNeedWrite() and writing out the page, or between reading
the page and the DataChecksumsNeedVerify() call. You can achieve that
with HOLD_INTERRUPTS() or a critical section, or simply ensuring that
there is no substantial code in between that could call
CHECK_FOR_INTERRUPTS(). And sendFile() in basebackup.c needs to be fixed.

Perhaps you could have "Assert(InteruptHoldOffCount > 0)" in
DataChecksumsNeedWrite() and DataChecksumsNeedVerify()? There could be
other ways that callers could avoid the TOCTOU issue, but it would
probably catch most of the unsafe call patterns, and you could always
wrap the DataChecksumsNeedWrite/verify() call in a dummy
HOLD_INTERRUPTS() block to work around the assertion if you know what
you're doing.

2. The signaling between enable_data_checksums() and the launcher
process looks funny to me. The general idea seems to be that
enable_data_checksums() just starts the launcher process, and the
launcher process figures out what it need to do and makes all the
changes to the global state. But then there's this violation of the
idea: enable_data_checksums() checks DataChecksumsOnInProgress(), and
tells the launcher process whether it should continue a previously
crashed operation or start from scratch. I think it would be much
cleaner if the launcher process figured that out itself, and
enable_data_checksums() would just tell the launcher what the target
state is.

enable_data_checksums() and disable_data_checksums() seem prone to race
conditions. If you call enable_data_checksums() in two backends
concurrently, depending on the timing, there are two possible outcomes:

a) One call returns true, and launches the background process. The other
call returns false.

b) Both calls return true, but one of them emits a "NOTICE: data
checksums worker is already running".

In disable_data_checksum() imagine what happens if another backend calls
enable_data_checksums() in between the
ShutdownDatachecksumsWorkerIfRunning() and SetDataChecksumsOff() calls.

> /*
> * Mark the buffer as dirty and force a full page write. We have to
> * re-write the page to WAL even if the checksum hasn't changed,
> * because if there is a replica it might have a slightly different
> * version of the page with an invalid checksum, caused by unlogged
> * changes (e.g. hintbits) on the master happening while checksums
> * were off. This can happen if there was a valid checksum on the page
> * at one point in the past, so only when checksums are first on, then
> * off, and then turned on again. Iff wal_level is set to "minimal",
> * this could be avoided iff the checksum is calculated to be correct.
> */
> MarkBufferDirty(buf);
> log_newpage_buffer(buf, false);

It's really unfortunate that we have to dirty the page even if the
checksum already happens to match. Could we only do the
log_newpage_buffer() call and skip MarkBufferDirty() in that case?

Could we get away with a more lightweight WAL record that doesn't
contain the full-page image, but just the block number? On replay, the
redo routine would read the page from disk.

- Heikki

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2020-10-05 11:39:05 Re: Support for OUT parameters in procedures
Previous Message Amit Kapila 2020-10-05 11:35:02 Re: deferred primary key and logical replication