Re: Enabling Checksums

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Enabling Checksums
Date: 2013-01-27 22:28:50
Message-ID: 1359325730.7413.33.camel@jdavis-laptop
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, 2013-01-26 at 23:23 -0500, Robert Haas wrote:
> > If we were to try to defer writing the WAL until the page was being
> > written, the most it would possibly save is the small XLOG_HINT WAL
> > record; it would not save any FPIs.
>
> How is the XLOG_HINT_WAL record kept small and why does it not itself
> require an FPI?

There's a maximum of one FPI per page per cycle, and we need the FPI for
any modified page in this design regardless.

So, deferring the XLOG_HINT WAL record doesn't change the total number
of FPIs emitted. The only savings would be on the trivial XLOG_HINT wal
record itself, because we might notice that it's not necessary in the
case where some other WAL action happened to the page.

> > At first glance, it seems sound as long as the WAL FPI makes it to disk
> > before the data. But to meet that requirement, it seems like we'd need
> > to write an FPI and then immediately flush WAL before cleaning a page,
> > and that doesn't seem like a win. Do you (or Simon) see an opportunity
> > here that I'm missing?
>
> I am not sure that isn't a win. After all, we can need to flush WAL
> before flushing a buffer anyway, so this is just adding another case -

Right, but if we get the WAL record in earlier, there is a greater
chance that it goes out with some unrelated WAL flush, and we don't need
to flush the WAL to clean the buffer at all. Separating WAL insertions
from WAL flushes seems like a fairly important goal, so I'm a little
skeptical of a proposal to narrow that gap so drastically.

It's hard to analyze without a specific proposal on the table. But if
cleaning pages requires a WAL record followed immediately by a flush, it
seems like that would increase the number of actual WAL flushes we need
to do by a lot.

> and the payoff is that the initial access to a page, setting hint
> bits, is quickly followed by a write operation, we avoid the need for
> any extra WAL to cover the hint bit change. I bet that's common,
> because if updating you'll usually need to look at the tuples on the
> page and decide whether they are visible to your scan before, say,
> updating one of them

That's a good point, I'm just not sure how avoid that problem without a
lot of complexity or a big cost. It seems like we want to defer the
XLOG_HINT WAL record for a short time; but not wait so long that we need
to clean the buffer or miss a chance to piggyback on another WAL flush.

> > By the way, the approach I took was to add the heap buffer to the WAL
> > chain of the XLOG_HEAP2_VISIBLE wal record when doing log_heap_visible.
> > It seemed simpler to understand than trying to add a bunch of options to
> > MarkBufferDirty.
>
> Unless I am mistaken, that's going to heavy penalize the case where
> the user vacuums an insert-only table. It will emit much more WAL
> than currently.

Yes, that's true, but I think that's pretty fundamental to this
checksums design (and of course it only applies if checksums are
enabled). We need to make sure an FPI is written and the LSN bumped
before we write a page.

That's why I was pushing a little on various proposals to either remove
or mitigate the impact of hint bits (load path, remove PD_ALL_VISIBLE,
cut down on the less-important hint bits, etc.). Maybe those aren't
viable, but that's why I spent time on them.

There are some other options, but I cringe a little bit thinking about
them. One is to simply exclude the PD_ALL_VISIBLE bit from the checksum
calculation, so that a torn page doesn't cause a problem (though
obviously that one bit would be vulnerable to corruption). Another is to
use a double-write buffer, but that didn't seem to go very far. Or, we
could abandon the whole thing and tell people to use ZFS/btrfs/NAS/SAN.

Regards,
Jeff Davis

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dimitri Fontaine 2013-01-27 22:31:16 Re: in-catalog Extension Scripts and Control parameters (templates?)
Previous Message Tom Lane 2013-01-27 22:27:21 Re: vacuuming template0