Re: Online checksums patch - once again

From: Daniel Gustafsson <daniel(at)yesql(dot)se>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, Michael Paquier <michael(at)paquier(dot)xyz>, Andres Freund <andres(at)anarazel(dot)de>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Online checksums patch - once again
Date: 2020-11-17 08:56:27
Message-ID: 329E487A-E0E9-481C-985E-52888302487C@yesql.se
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On 5 Oct 2020, at 13:36, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>
> I looked at patch v22, and I can see two main issues:

Thanks for reviewing!

> 1. The one that Robert talked about earlier: A backend checks the local "checksums" state. If it's 'off', it writes a page without checksums. How do you guarantee that the local state doesn't change in between? The implicit assumption seems to be that there MUST NOT be any CHECK_FOR_INTERRUPTS() calls between DataChecksumsNeedWrite() and the write (or read and DataChecksumsNeedVerify()).
>
> In most code, the DataChecksumsNeedWrite() call is very close to writing out the page, often in the same critical section. But this is an undocumented assumption.

I've extended the documentation on this.

> The code in sendFile() in basebackup.c seems suspicious in that regard. It calls DataChecksumsNeedVerify() once before starting to read the file. Isn't it possible for the checksums flag to change while it's reading the file and sending it to the client? I hope there are CHECK_FOR_INTERRUPTS() calls buried somewhere in the loop, because it could take minutes to send the whole file.

Agreed, fixed.

> I would feel better if the state transition of the "checksums" flag could only happen in a few safe places, or there were some other safeguards for this. I think that's what Andres was trying to say earlier in the thread on ProcSignalBarriers. I'm not sure what the interface to that should be. It could be something like HOLD/RESUME_INTERRUPTS(), where normally all procsignals are handled on CHECK_FOR_INTERRUPTS(), but you could "hold off" some if needed. Or something else. Or maybe we can just use HOLD/RESUME_INTERRUPTS() for this. It's more coarse-grained than necessary, but probably doesn't matter in practice.
>
> At minimum, there needs to be comments in DataChecksumsNeedWrite() and DataChecksumsNeedVerify(), instructing how to use them safely. Namely, you must ensure that there are no interrupts between the DataChecksumsNeedWrite() and writing out the page, or between reading the page and the DataChecksumsNeedVerify() call. You can achieve that with HOLD_INTERRUPTS() or a critical section, or simply ensuring that there is no substantial code in between that could call CHECK_FOR_INTERRUPTS(). And sendFile() in basebackup.c needs to be fixed.
>
> Perhaps you could have "Assert(InteruptHoldOffCount > 0)" in DataChecksumsNeedWrite() and DataChecksumsNeedVerify()? There could be other ways that callers could avoid the TOCTOU issue, but it would probably catch most of the unsafe call patterns, and you could always wrap the DataChecksumsNeedWrite/verify() call in a dummy HOLD_INTERRUPTS() block to work around the assertion if you know what you're doing.

The attached holds off interrupt processing for the NeedWrite and NeedVerify
cases, and holds them for what I hope is the right duration for the respective
callsites.

One thing I realized in doing this is that the pg_stat_database checksums
statistics are set to NULL when checksums are disabled. That makes perfect
sense when checksum state is static, but not when it can be turned on/off. For
now I've made it so that stats are set to zero instead, and will continue
showing stats even if checksums gets disabled. Not sure what the best option
would be here.

> 2. The signaling between enable_data_checksums() and the launcher process looks funny to me. The general idea seems to be that enable_data_checksums() just starts the launcher process, and the launcher process figures out what it need to do and makes all the changes to the global state. But then there's this violation of the idea: enable_data_checksums() checks DataChecksumsOnInProgress(), and tells the launcher process whether it should continue a previously crashed operation or start from scratch. I think it would be much cleaner if the launcher process figured that out itself, and enable_data_checksums() would just tell the launcher what the target state is.
>
> enable_data_checksums() and disable_data_checksums() seem prone to race conditions. If you call enable_data_checksums() in two backends concurrently, depending on the timing, there are two possible outcomes:
>
> a) One call returns true, and launches the background process. The other call returns false.
>
> b) Both calls return true, but one of them emits a "NOTICE: data checksums worker is already running".
>
> In disable_data_checksum() imagine what happens if another backend calls enable_data_checksums() in between the ShutdownDatachecksumsWorkerIfRunning() and SetDataChecksumsOff() calls.

I've reworked this in the attached such that the enable_ and disable_ functions
merely call into the launcher with the desired outcome, and the launcher is
responsible for figuring out the rest. The datachecksumworker is now the sole
place which initiates a state transfer.

>> /*
>> * Mark the buffer as dirty and force a full page write. We have to
>> * re-write the page to WAL even if the checksum hasn't changed,
>> * because if there is a replica it might have a slightly different
>> * version of the page with an invalid checksum, caused by unlogged
>> * changes (e.g. hintbits) on the master happening while checksums
>> * were off. This can happen if there was a valid checksum on the page
>> * at one point in the past, so only when checksums are first on, then
>> * off, and then turned on again. Iff wal_level is set to "minimal",
>> * this could be avoided iff the checksum is calculated to be correct.
>> */
>> START_CRIT_SECTION();
>> MarkBufferDirty(buf);
>> log_newpage_buffer(buf, false);
>> END_CRIT_SECTION();
>
> It's really unfortunate that we have to dirty the page even if the checksum already happens to match. Could we only do the log_newpage_buffer() call and skip MarkBufferDirty() in that case?

I think we can, but I've intentionally stayed away from such optimizations
until the basic version of the patch was deemed safe and approaching done.
It's complicated enough as it is IMO.

> Could we get away with a more lightweight WAL record that doesn't contain the full-page image, but just the block number? On replay, the redo routine would read the page from disk.

Quite possibly, but I'm not sure how to reason about such a change to ensure
it's safety. I would love any ideas you'd have.

The attached fixes the above plus a few other small things found while hacking
on this version.

cheers ./daniel

Attachment Content-Type Size
online_checksums23.patch application/octet-stream 122.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Li Japin 2020-11-17 09:01:58 Re: Terminate the idle sessions
Previous Message Kyotaro Horiguchi 2020-11-17 08:29:16 Re: Cache relation sizes?