Re: Checksums by default?

From: David Steele <david(at)pgmasters(dot)net>
To: Stephen Frost <sfrost(at)snowman(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Peter Geoghegan <pg(at)heroku(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Checksums by default?
Date: 2017-01-30 17:29:04
Message-ID: b8f21c38-3b28-a50f-d997-a1c113136a6d@pgmasters.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 1/25/17 10:38 PM, Stephen Frost wrote:
> * Robert Haas (robertmhaas(at)gmail(dot)com) wrote:
>> On Wed, Jan 25, 2017 at 7:37 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>>> On 2017-01-25 19:30:08 -0500, Stephen Frost wrote:
>>>> * Peter Geoghegan (pg(at)heroku(dot)com) wrote:
>>>>> On Wed, Jan 25, 2017 at 3:30 PM, Stephen Frost <sfrost(at)snowman(dot)net> wrote:
>>>>>> As it is, there are backup solutions which *do* check the checksum when
>>>>>> backing up PG. This is no longer, thankfully, some hypothetical thing,
>>>>>> but something which really exists and will hopefully keep users from
>>>>>> losing data.
>>>>>
>>>>> Wouldn't that have issues with torn pages?
>>>>
>>>> No, why would it? The page has either been written out by PG to the OS,
>>>> in which case the backup s/w will see the new page, or it hasn't been.
>>>
>>> Uh. Writes aren't atomic on that granularity. That means you very well
>>> *can* see a torn page (in linux you can e.g. on 4KB os page boundaries
>>> of a 8KB postgres page). Just read a page while it's being written out.
>>
>> Yeah. This is also why backups force full page writes on even if
>> they're turned off in general.
>
> I've got a question into David about this, I know we chatted about the
> risk at one point, I just don't recall what we ended up doing (I can
> imagine a few different possible things- re-read the page, which isn't a
> guarantee but reduces the chances a fair bit, or check the LSN, or
> perhaps the plan was to just check if it's in the WAL, as I mentioned)
> or if we ended up concluding it wasn't a risk for some, perhaps
> incorrect, reason and need to revisit it.

The solution was to simply ignore the checksums of any pages with an LSN
>= the LSN returned by pg_start_backup(). This means that hot blocks
may never be checked during backup, but if they are active then any
problems should be caught directly by PostgreSQL.

This technique assumes that blocks can be consistently read in the order
they were written. If the second 4k (or 512 byte, etc.) block of the
fwrite is visible before the first 4k block then there would a false
positive. I have a hard time imagining any sane buffering system
working this way, but I can't discount it.

It's definitely possible for pages on disk to have this characteristic
(i.e., the first block is not written first) but that should be fixed
during recovery before it is possible to take a backup.

Note that reports of page checksum errors are informational only and do
not have any effect on the backup process. Even so we would definitely
prefer to avoid false positives. If anybody can poke a hole in this
solution then I would like to hear it.

--
-David
david(at)pgmasters(dot)net

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2017-01-30 17:34:41 Re: sequence data type
Previous Message David Fetter 2017-01-30 17:08:36 Re: One-shot expanded output in psql using \G