Re: crash-safe visibility map, take three

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: crash-safe visibility map, take three
Date: 2010-12-02 04:22:12
Message-ID: AANLkTi=DOvFWWZFNxJObeAWiEu9dyAcTxLUQCgA1fNdt@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Dec 1, 2010 at 5:24 PM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> On Wed, 2010-12-01 at 15:59 -0500, Robert Haas wrote:
>> As for CRCs, there's a pretty direct chain of inference here:
>>
>> 1. CRCs are hard (really impossible) because we have hint bits.
>
> I would disagree with "impossible". If we don't set hint bits during
> reading; and when we do set them, we log them (including full page
> writes); then we can do CRCs.
>
> Those things have costs, but we might be willing to pay them if we had a
> bulk loading strategy that avoids or mitigates the costs.
>
> The reason we can't do CRCs now is because hint bits violate the
> WAL-before-data rule; not because of hint bits themselves. We're talking
> about adding another feature that breaks the rule, in a more complex way
> than hint bits.
>
> I just wanted to step back for a second and consider the problem from a
> different angle before we committed to that.

Well, let's think about what we'd need to do to make CRCs work
reliably. There are two problems.

1. Currently, hint bits are not vulnerable to the torn-page problem,
because the hint bit change is to single byte, and neither of the two
possible values for the affected byte invalidate the contents of the
block. Thus, they do not need to be WAL-logged - we're happy if they
all make it to disk, but if some or none of them make it to disk,
that's OK. If we CRC the entire page, the torn pages are never
acceptable, so every action that modifies the page must be WAL-logged.

2. Currently, we allow hint bits on a page to be updated while holding
a shared-content lock; we also allow the page to be written while
holding only a shared-content lock. This makes it a bit
nondeterministic whether the hint bit update is included in the write,
but we don't care. If we were to compute a CRC and write that into
the page before writing it out to the OS, it would be unacceptable for
the page contents to change thereafter in any way.

So, to make CRCs work, we'd need to (a) WAL-log every hint bit update
and (b) change either hint bit updates or page write-outs to require
an exclusive content lock rather than a shared one. The first would
result in an increase in I/O, while the second would result in a
reduction in concurrency. Thinking about it a bit, I wonder if we
couldn't mitigate (b) quite a bit by adding a new level for buffer
content locks, share exclusive. This would conflict with itself and
with exclusive but not with share locks, and would be required to set
hint bits or write the buffer. When setting hint bits with only a
share lock, we'd attempt to do a non-blocking upgrade to share
exclusive. If that failed - because someone else already held a
share-exclusive lock - we'd just skip the hint bit update. I have no
idea what to do about (a), though.

*thinks some more*

Or maybe I do. One other thing I've been thinking about with regard
to hint bit updates is that we might choose to mark that are
hint-bit-updated as "untidy" rather than "dirty". The background
writer could treat these pages as dirty, but checkpoints and backends
doing desperation-buffer-reclamation could treat them as clean. This
would allow hint bit updates to trickle out to disk in the background,
without letting them bottleneck anything on the critical path. Maybe
we could do this - if CRCs are enabled and we are the background
writer cleaning scan, write dirty buffers in the usual way and write
untidy buffers to a "double-write buffer" (to borrow a page from
InnoDB) along with the current LSN. At the conclusion of the scan,
fsync() the double-write buffer and then write the buffers a second
time in the normal fashion if their mappings haven't changed and they
are still untidy. On redo, when you reach an LSN recorded in the
double-write buffer, restore the FPI. In general, a double-write
buffer is inferior to our existing FPI system, because you end up
needing to fsync both the double-write buffer and the WAL stream. But
it might be OK in this case, if it's all happening as background work.

--

With respect to your concerns about this method, after some thought, I
think #2 isn't an issue at all, because I don't believe we can risk
having our update to HEAP_XMIN_FROZEN stomped on by someone else
trying to set HEAP_XMIN_COMMITTED, so I think that when making a page
all-visible we'll need an exclusive (or share-exclusive) content lock
anyway. As to #1, I think we could restore the WAL-before-data rules
if we kept a bit somewhere in the buffer descriptor indicating whether
a given buffer has had an FPI since the last checkpoint. Then,
perhaps, WAL records that are torn-page-safe could bump the TLI
without emitting a FPI. The next WAL record to come along would be
able to determine that one was still needed. Of course, to make CRCs
work with this, you still need to emit FPIs or use a double-write
buffer. That sucks, and I don't know what to do about it. Since our
current hint-bit updates are not WAL-logged, a CRC implementation over
it could try to get by with chunking untidy buffers (either all the
time or just sometimes) without actually writing them. But these
updates WILL be WAL-logged, so you can't just refuse to write them
after the fact. Hmm...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andy Colson 2010-12-02 04:27:37 Re: unlogged tables
Previous Message Florian Pflug 2010-12-02 03:53:59 Re: improving foreign key locks