Re: 16-bit page checksums for 9.2

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-29 15:01:37
Message-ID: CA+U5nMJST4cJnd6SWDErxP_hrE7EmWg=erc==G7d0fSwwOpu6w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 29, 2012 at 2:40 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> On 22.02.2012 14:30, Simon Riggs wrote:
>>
>> On Wed, Feb 22, 2012 at 7:06 AM, Noah Misch<noah(at)leadboat(dot)com>  wrote:
>>>
>>> On Sun, Feb 19, 2012 at 05:04:06PM -0500, Robert Haas wrote:
>>>>
>>>> Another disadvantage of the current scheme is that there's no
>>>> particularly easy way to know that your whole cluster has checksums.
>>>> No matter how we implement checksums, you'll have to rewrite every
>>>> table in the cluster in order to get them fully turned on.  But with
>>>> the current design, there's no easy way to know how much of the
>>>> cluster is actually checksummed.  If you shut checksums off, they'll
>>>> linger until those pages are rewritten, and there's no easy way to
>>>> find the relations from which they need to be removed, either.
>>>
>>>
>>> I'm not seeing value in rewriting pages to remove checksums, as opposed
>>> to
>>> just ignoring those checksums going forward.  Did you have a particular
>>> scenario in mind?
>>
>>
>> Agreed. No reason to change a checksum unless we rewrite the block, no
>> matter whether page_checksums is on or off.
>
>
> This can happen:
>
> 1. checksums are initially enabled. A page is written, with a correct
> checksum.
> 2. checksums turned off.
> 3. A hint bit is set on the page.
> 4. While the page is being written out, someone pulls the power cord, and
> you get a torn write. The hint bit change made it to disk, but the clearing
> of the checksum in the page header did not.
> 5. Sometime after restart, checksums are turned back on.
>
> The page now has an incorrect checksum on it. The next time it's read, you
> get a checksum error.

Yes, you will. And you'll get a checksum error because the block no
longer passes. So an error should be reported.

We can and should document that turning this on/off/on can cause
problems. Hopefully crashing isn't that common a situation.

The production default would be "off". The default in the patch is
"on" only for testing.

> I'm pretty uncomfortable with this idea of having a flag on the page itself
> to indicate whether it has a checksum or not. No matter how many bits we use
> for that flag. You can never be quite sure that all your data is covered by
> the checksum, and there's a lot of room for subtle bugs like the above,
> where a page is reported as corrupt when it isn't, or vice versa.

That is necessary to allow upgrade. It's not their for any other reason.

> This thing needs to be reliable and robust. The purpose of a checksum is to
> have an extra sanity check, to detect faulty hardware. If it's complicated,
> whenever you get a checksum mismatch, you'll be wondering if you have broken
> hardware or if you just bumped on a PostgreSQL bug. I think you need a flag
> in pg_control or somewhere to indicate whether checksums are currently
> enabled or disabled, and a mechanism to scan and rewrite all the pages with
> checksums, before they are verified.

That would require massive downtime, so again, it has been ruled out
for practicality.

> I've said this before, but I still don't like the hacks with the version
> number in the page header. Even if it works, I would much prefer the
> straightforward option of extending the page header for the new field. Yes,
> it means you have to deal with pg_upgrade, but it's a hurdle we'll have to
> jump at some point anyway.

What you suggest might happen in the next release, or maybe longer.
There may be things that block it completely, so it might never
happen. My personal opinion is that it is not possible to make further
block format changes until we have a fully online upgrade process,
otherwise we block people from upgrading - not everybody can take
their site down to run pg_upgrade. I plan to work on that, but it may
not happen for 9.3; perhaps you will object to that also when it
comes.

So we simply cannot rely on this "jam tomorrow" vision.

This patch is very specifically something that makes the best of the
situation, now, for those that want and need it. If you don't want it,
you don't have to use it. But that shouldn't stop us giving it to the
people that do want it.

I'm hearing general interest and support for this feature from people
that run their business on PostgreSQL.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Tautschnig 2012-02-29 15:18:54 Weak-memory specific problem in ResetLatch/WaitLatch (follow-up analysis)
Previous Message Kevin Grittner 2012-02-29 14:59:27 Re: SSI rw-conflicts and 2PC