Re: 16-bit page checksums for 9.2

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-19 22:04:06
Message-ID: CA+TgmoaYZrGEcvbTSXkLQYXjFPz-nqRVZKrm3Ss-66wjUU67Gw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Feb 19, 2012 at 11:35 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> Doesn't this seem awfully bad for performance on Hot Standby servers?
>> I agree that it fixes the problem with un-WAL-logged pages there, but
>> I seem to recall some recent complaining about performance features
>> that work on the master but not the standby.  Durable hint bits are
>> one such feature.
>
> It's impossible for it to work, in this case, since we cannot write
> new WAL to prevent torn pages.
>
> Note that hint bit setting on a dirty block is allowed, so many hints
> will still be set in Hot Standby.

To me, it seems that you are applying a double standard. You have
twice attempted to insist that I do extra work to make major features
that I worked on - unlogged tables and index-only scans - work in Hot
Standby mode, despite the existence of significant technological
obstacles. But when it comes to your own feature, you simply state
that it cannot be done, and therefore we need not do it. Of course,
this feature, like those, CAN be made to work. It just involves
solving difficult problems that have little to do with the primary
purpose of the patch. To be honest, I don't use Hot Standby enough to
care very much about this particular issue, and I'm not saying we
should reject it on these grounds. But I do think it's a mistake to
dismiss it entirely, since every limitation of Hot Standby narrows the
set of cases in which it can be deployed. And at any rate, I want the
same standard applied to my work as to yours.

>> I am slightly worried that this expansion in the use of this mechanism
>> (formerly called inCommit, for those following along at home) could
>> lead to checkpoint starvation.  Suppose we've got one or two large
>> table scans wandering along, setting hint bits, and now suddenly it's
>> time to checkpoint.  How long will it take the checkpoint process to
>> find a time when nobody's got delayChkpt set?
>
> We don't need to wait until nobody has it set, we just need to wait
> for the people that had it set when we first checked to be out of that
> state momentarily.

Ah... good point.

>> So, when the page has a checksum, PD_CHECKSUM2 is not set, and when it
>> doesn't have a checksum, PD_CHECKSUM2 is not set?  What good does that
>> do?
>
> As explained in detailed comments, the purpose of this is to implement
> Heikki's suggestion that we have a bit set to zero so we can detect
> failures that cause a run of 1s.

I think it's nonsensical to pretend that there's anything special
about that particular bit. If we want to validate the page header
before trusting the lack of a checksum bit, we can do that far more
thoroughly than just checking that one bit. There are a whole bunch
of bits that ought to always be zero, and there are other things we
can validate as well (e.g. LSN not in future). If we're concerned
about the checksum-enabled bit getting flipped (and I agree that we
should be), we can check some combination of that stuff in the hope of
catching it, and that'll be a far better guard than just checking one
arbitrarily selected bit.

That having been said, I don't feel very good about the idea of
relying on the contents of the page to tell us whether or not the page
has a checksum. There's no guarantee that an error that flips the
has-checksum bit will flip any other bit on the page, or that it won't
flip everything else we're relying on as a backstop in exactly the
manner that foils whatever algorithm we put in place. Random
corruption is, perhaps, unlikely to do that, but somehow I feel like
it'll happen more often than random chance suggests. Things never
fail the way you want them to.

Another disadvantage of the current scheme is that there's no
particularly easy way to know that your whole cluster has checksums.
No matter how we implement checksums, you'll have to rewrite every
table in the cluster in order to get them fully turned on. But with
the current design, there's no easy way to know how much of the
cluster is actually checksummed. If you shut checksums off, they'll
linger until those pages are rewritten, and there's no easy way to
find the relations from which they need to be removed, either.

I'm tempted to suggest a relation-level switch: when you want
checksums, you use ALTER TABLE to turn them on, and when you don't
want them any more you use ALTER TABLE to shut them off again, in each
case rewriting the table. That way, there's never any ambiguity about
what's in the data pages in a given relation: either they're either
all checksummed, or none of them are. This moves the decision about
whether checksums are enabled or disabled quite a but further away
from the data itself, and also allows you to determine (by catalog
inspection) which parts of the cluster do in fact have checksums. It
might be kind of a pain to implement, though: you'd have to pass the
information about how any given relation was configured down to the
place where we validate page sanity. I'm not sure whether that's
practical.

I also think that the question of where exactly the checksum ought to
be put might bear some more thought. Tom expressed some concern about
stealing the page version field, and it seems to me we could work
around that by instead stealing something less valuable. The top
eight bits of the pd_pagesize_version field probably meet that
criteria, since in practice basically everybody uses 8K blocks, and
the number of errors that will be caught by checksums is probably much
larger than the number of errors that will be caught by the page-size
cross-check. But maybe the other 8 bits should come from somewhere
else, maybe pd_tli or pd_flags. For almost all practical purposes,
storing only the low-order 8 bits of the TLI ought to provide just as
much of a safety check as storing the low-order 16 bits, so I think
that might be the way to go. We could set it up so that we always
check only the low 8 bits against the TLI, regardless of whether
checksums are enabled; then we basically free up that bit space at no
cost in code complexity.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2012-02-19 22:18:31 Re: Reducing bgwriter wakeups
Previous Message Simon Riggs 2012-02-19 21:49:48 Re: 16-bit page checksums for 9.2