Re: Offline enabling/disabling of data checksums

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, Michael Banck <michael(dot)banck(at)credativ(dot)de>, Postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Offline enabling/disabling of data checksums
Date: 2018-12-28 00:14:05
Message-ID: bd6b833b-8250-0810-dfb6-4d8b3e6581b0@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12/28/18 12:25 AM, Michael Paquier wrote:
> On Thu, Dec 27, 2018 at 03:46:48PM +0100, Tomas Vondra wrote:
>> On 12/27/18 11:43 AM, Magnus Hagander wrote:
>>> Should we double-check with packagers that this won't cause a problem?
>>> Though the fact that it's done in a major release should make it
>>> perfectly fine I think -- and it's a smaller change than when we did all
>>> those xlog->wal changes...
>>>
>>
>> I think it makes little sense to not rename the tool now. I'm pretty
>> sure we'd end up doing that sooner or later anyway, and we'll just live
>> with a misnamed tool until then.
>
> Do you think that a thread Would on -packagers be more adapted then?
>

I'm sorry, but I'm not sure I understand the question. Of course, asking
over at -packagers won't hurt, but my guess is the response will be it's
not a big deal from the packaging perspective.

>> I don't know, TBH. I agree making the on/off change cheaper makes moves
>> us closer to 'on' by default, because they may disable it if needed. But
>> it's not the whole story.
>>
>> If we enable checksums by default, 99% users will have them enabled.
>> That means more people will actually observe data corruption cases that
>> went unnoticed so far. What shall we do with that? We don't have very
>> good answers to that (tooling, docs) and I'd say "disable checksums" is
>> not a particularly amazing response in this case :-(
>
> Enabling data checksums by default is still a couple of steps ahead,
> without a way to control them better..
>

What do you mean by "control" here? Dealing with checksum failures, or
some additional capabilities?

>> FWIW I don't know what to do about that. We certainly can't prevent the
>> data corruption, but maybe we could help with fixing it (although that's
>> bound to be low-level work).
>
> Yes, data checksums are extremely useful to tell people when the
> problem is *not* from Postgres, which can be really hard in a large
> organization. Knowing about the corrupted page is also useful as you
> can look at its contents and look at its bytes before it gets zero'ed
> to spot patterns which can help other teams in charge of a lower level
> of the application layer.

I'm not sure data checksums are particularly great evidence. For example
with the recent fsync issues, we might have ended with partial writes
(and thus invalid checksums). The OS migh have even told us about the
failure, but we've gracefully ignored it. So I'm afraid data checksums
are not a particularly great proof it's not our fault.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2018-12-28 00:46:41 Re: GIN predicate locking slows down valgrind isolationtests tremendously
Previous Message Petr Jelinek 2018-12-27 23:36:25 Re: row filtering for logical replication