Re: Online enabling of checksums

From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Daniel Gustafsson <daniel(at)yesql(dot)se>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Subject: Re: Online enabling of checksums
Date: 2018-03-20 14:40:53
Message-ID: CABUevEx_Zd+-h=-zKQuBLhmZd+TT1xbdwo9sVwCbrZ2bTHfbbw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 19, 2018 at 7:27 AM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:

> Hi,
>
> The patch looks good to me at a high level. Some notes below. I didn't
> read through the whole thread, so sorry if some of these have been
> discussed already.
>
> +void
>> +ShutdownChecksumHelperIfRunning(void)
>> +{
>> + if (pg_atomic_unlocked_test_flag(&ChecksumHelperShmem->launcher
>> _started))
>> + {
>> + /* Launcher not started, so nothing to shut down */
>> + return;
>> + }
>> +
>> + ereport(ERROR,
>> + (errmsg("checksum helper is currently running,
>> cannot disable checksums"),
>> + errhint("Restart the cluster or wait for the
>> worker to finish.")));
>> +}
>>
>
> Is there no way to stop the checksum helper once it's started? That seems
> rather user-unfriendly. I can imagine it being a pretty common mistake to
> call pg_enable_data_checksums() on a 10 TB cluster, only to realize that
> you forgot to set the cost limit, and that it's hurting queries too much.
> At that point, you want to abort.
>

Agreed. You could do it with pg_terminate_backend() but you had to do it in
the right order, etc.

Attached patch fixes this by making it possible to abort the process by
executing pg_disable_data_checksums() during the process. In this case the
live workers will abort, and the checksums will be switched off again.

> +extern bool DataChecksumsEnabledOrInProgress(void);
>> +extern bool DataChecksumsInProgress(void);
>> +extern bool DataChecksumsDisabled(void);
>>
>
> I find the name of the DataChecksumsEnabledOrInProgress() function a bit
> long. And doing this in PageIsVerified looks a bit weird:
>
> > if (DataChecksumsEnabledOrInProgress() && !DataChecksumsInProgress())
>
> I think I'd prefer functions like:
>
> /* checksums should be computed on write? */
> bool DataChecksumNeedWrite()
> /* checksum should be verified on read? */
> bool DataChecksumNeedVerify()
>

Agreed. We also need DataChecksumsInProgress() to make it work properly,
but that makes the names a lot more clear. Adjusted as such.

> + <literal>template0</literal> is by default not accepting
>> connections, to
>> + enable checksums you'll need to temporarily make it accept
>> connections.
>>
>
> This was already discussed, and I agree with the other comments that it's
> bad.
>

Do you have any opinion on the thread about how to handle that one? As in
which way to go about solving it? (The second thread linked from the CF
entry - it wasn't linked before as intended, but it is now)

+CREATE OR REPLACE FUNCTION pg_enable_data_checksums (
>> + cost_delay int DEFAULT 0, cost_limit int DEFAULT 100)
>> + RETURNS boolean STRICT VOLATILE LANGUAGE internal AS
>> 'enable_data_checksums'
>> + PARALLEL RESTRICTED;
>>
>
> pg_[enable|disable]_checksums() functions return a bool. It's not clear
> to me when they would return what. I'd suggest marking them as 'void'
> instead.

Agreed, changed.

--- a/src/include/storage/bufpage.h
>> +++ b/src/include/storage/bufpage.h
>> @@ -194,6 +194,7 @@ typedef PageHeaderData *PageHeader;
>> */
>> #define PG_PAGE_LAYOUT_VERSION 4
>> #define PG_DATA_CHECKSUM_VERSION 1
>> +#define PG_DATA_CHECKSUM_INPROGRESS_VERSION 2
>>
>>
>
> This seems like a weird place for these PG_DATA_CHECKSUM_* constants.
> They're not actually stored in the page header, as you might think. I think
> the idea was to keep PG_DATA_CHECKSUM_VERSION close to
> PG_PAGE_LAYOUT_VERSION, because the checksums affected the on-disk format.
> I think it was a bit weird even before this patch, but now it's worse. At
> least a better comment would be in order, or maybe move these elsewhere.
>

True, but they apply to the page level? I'm not sure about the original
reasoning to put them there,figured it wasn't the responsibility of this
patch to move them. But we can certainly do that.

But what would be an appropriate place elsewhere? First thought would be
pg_control.h, but that would then not be consistent with e.g. wal_level
(which is declared in xlog.h and not pg_control.h..

Feature request: it'd be nice to update the 'ps status' with
> set_ps_display, to show a rudimentary progress indicator. Like the name and
> block number of the relation being processed. It won't tell you how much is
> left to go, but at least it will give a warm fuzzy feeling to the DBA that
> something is happening.
>

In the attached patch it sets this information in pg_stat_activity. I think
that makes more sense than the ps display, and I think is more consistent
with other ways we use them (e.g. autovacuum doesn't update it's ps title
for every table, but it does update pg_stat_activity).

I didn't review the offline tool, pg_verify_checksums.

PFA a patch that takes into account these comments and we believe all other
pending ones as well.

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/>
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>

Attachment Content-Type Size
online_checksums6.patch text/x-patch 74.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2018-03-20 14:41:57 Re: XID-assigned idle transactions affect vacuum's job.
Previous Message Tom Lane 2018-03-20 14:40:26 Re: configure's checks for --enable-tap-tests are insufficient