On Wed, Feb 08, 2012 at 11:40:34AM +0000, Simon Riggs wrote:
> On Wed, Feb 8, 2012 at 3:24 AM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> > On Tue, Feb 07, 2012 at 08:58:59PM +0000, Simon Riggs wrote:
> >> On Thu, Jan 26, 2012 at 8:20 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> >> > This patch uses FPIs to guard against torn hint writes, even when the
> >> > checksums are disabled. ?One could not simply condition them on the
> >> > page_checksums setting, though. ?Suppose page_checksums=off and we're hinting
> >> > a page previously written with page_checksums=on. ?If that write tears,
> >> > leaving the checksum intact, that block will now fail verification. ?A couple
> >> > of ideas come to mind. ?(a) Read the on-disk page and emit an FPI only if the
> >> > old page had a checksum. ?(b) Add a checksumEnableLSN field to pg_control.
> >> > Whenever the cluster starts with checksums disabled, set the field to
> >> > InvalidXLogRecPtr. ?Whenever the cluster starts with checksums enabled and the
> >> > field is InvalidXLogRecPtr, set the field to the next LSN. ?When a checksum
> >> > failure occurs in a page with LSN older than the stored one, emit either a
> >> > softer warning or no message at all.
> >> We can only change page_checksums at restart (now) so the above seems moot.
> >> If we crash with FPWs enabled we repair any torn pages.
> > There's no live bug, but that comes at a high cost: the patch has us emit
> > full-page images for hint bit writes regardless of the page_checksums setting.
> Sorry, I don't understand what you mean. I don't see any failure cases
> that require that.
> page_checksums can only change at a shutdown checkpoint,
> The statement "If that write tears,
> >> > leaving the checksum intact, that block will now fail verification."
> cannot happen, ISTM.
> If we write out a block we update the checksum if page_checksums is
> set, or we clear it. If we experience a torn page at crash, the FPI
> corrects that, so the checksum never does fail verification. We only
> need to write a FPI when we write with checksums.
> If that's wrong, please explain a failure case in detail.
In the following description, I will treat a heap page as having two facts.
The first is either CHKSUM or !CHKSUM, and the second is either HINT or !HINT.
A torn page write lacking the protection of a full-page image can update one
or both facts despite the transaction having logically updated both. (This is
just the normal torn-page scenario.) Given that, consider the following
sequence of events:
1. startup with page_checksums = on
2. update page P with facts CHKSUM, !HINT
3. clean write of P
4. clean shutdown
5. startup with page_checksums = off
6. update P with facts !CHKSUM, HINT. no WAL since page_checksums=off
7. begin write of P
8. crash. torn write of P only updates HINT. disk state now CHKSUM, HINT
9. startup with page_checksums = off
10. crash recovery. does not touch P, because step 6 generated no WAL
11. clean shutdown
12. startup with page_checksums = on
13. read P. checksum failure
Again, this cannot happen with checksum16_with_wallogged_hint_bits.v7.patch,
because step 6 _does_ write WAL. That ought to change for the sake of
page_checksums=off performance, at which point we must protect the above
scenario by other means.
> >> > PageSetLSN() is not atomic, so the shared buffer content lock we'll be holding
> >> > is insufficient.
> >> Am serialising this by only writing PageLSN while holding buf hdr lock.
> > That means also taking the buffer header spinlock in every PageGetLSN() caller
> > holding only a shared buffer content lock. ?Do you think that will pay off,
> > versus the settled pattern of trading here your shared buffer content lock for
> > an exclusive one?
> Yes, I think it will pay off. This is the only code location where we
> do that, and we are already taking the buffer header lock, so there is
> effectively no overhead.
The sites I had in the mind are the calls to PageGetLSN() in FlushBuffer()
(via BufferGetLSN()) and XLogCheckBuffer(). We don't take the buffer header
spinlock at either of those locations. If they remain safe under the new
rules, we'll need comments explaining why. I think they may indeed be safe,
but it's far from obvious.
In response to
pgsql-hackers by date
|Next:||From: Robert Haas||Date: 2012-02-08 14:37:01|
|Subject: Re: Add protransform for numeric, varbit, and temporal types|
|Previous:||From: Peter Geoghegan||Date: 2012-02-08 13:33:30|
|Subject: Re: Progress on fast path sorting, btree index creation time|