Re: Protecting against unexpected zero-pages: proposal

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Aidan Van Dyk <aidan(at)highrise(dot)ca>, Jim Nasby <jim(at)nasby(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Protecting against unexpected zero-pages: proposal
Date: 2010-11-09 21:50:23
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Nov 9, 2010 at 2:05 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Tue, Nov 9, 2010 at 12:31 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
>> On Tue, Nov 9, 2010 at 5:06 PM, Aidan Van Dyk <aidan(at)highrise(dot)ca> wrote:
>>> So, for getting checksums, we have to offer up a few things:
>>> 1) zero-copy writes, we need to buffer the write to get a consistent
>>> checksum (or lock the buffer tight)
>>> 2) saving hint-bits on an otherwise unchanged page.  We either need to
>>> just not write that page, and loose the work the hint-bits did, or do
>>> a full-page WAL of it, so the torn-page checksum is fixed
>> Actually the consensus the last go-around on this topic was to
>> segregate the hint bits into a single area of the page and skip them
>> in the checksum. That way we don't have to do any of the above. It's
>> just that that's a lot of work.
> And it still allows silent data corruption, because bogusly clearing a
> hint bit is, at the moment, harmless, but bogusly setting one is not.
> I really have to wonder how other products handle this.  PostgreSQL
> isn't the only database product that uses MVCC - not by a long shot -
> and the problem of detecting whether an XID is visible to the current
> snapshot can't be ours alone.  So what do other people do about this?
> They either don't cache the information about whether the XID is
> committed in-page (in which case, are they just slower or do they have
> some other means of avoiding the performance hit?) or they cache it in
> the page (in which case, they either WAL log it or they don't checksum
> it).  I mean, there aren't any other options, are there?

An examination of the MySQL source code reveals their answer. In
row_vers_build_for_semi_consistent_read(), which I can't swear is the
right place but seems to be, there is this comment:

/* We assume that a rolled-back transaction stays in
TRX_ACTIVE state until all the changes have been
rolled back and the transaction is removed from
the global list of transactions. */

Which makes sense. If you never leave rows from aborted transactions
in the heap forever, then the list of aborted transactions that you
need to remember for MVCC purposes will remain relatively small and
you can just include those XIDs in your MVCC snapshot. Our problem is
that we have no particular bound on the number of aborted transactions
whose XIDs may still be floating around, so we can't do it that way.

<dons asbestos underpants>

To impose a similar bound in PostgreSQL, you'd need to maintain the
set of aborted XIDs and the relations that need to be vacuumed for
each one. As you vacuum, you prune any tuples with aborted xmins
(which is WAL-logged already anyway) and additionally WAL-log clearing
the xmax for each tuple with an aborted xmax. Thus, when you
finishing vacuuming the relation, the aborted XID is no longer present
anywhere in it. When you vacuum the last relation for a particular
XID, that XID no longer exists in the relation files anywhere and you
can remove it from the list of aborted XIDs. I think that WAL logging
the list of XIDs and list of unvacuumed relations for each at each
checkpoint would be sufficient for crash safety. If you did this, you
could then assume that any XID which precedes your snapshot's xmin is

1. When a big abort happens, you may have to carry that XID around in
every snapshot - and avoid advancing RecentGlobalXmin - for quite a
long time.
2. You have to WAL log marking the XMAX of an aborted transaction invalid.
3. You have to WAL log the not-yet-cleaned-up XIDs and the relations
each one needs vacuumed at each checkpoint.
4. There would presumably be some finite limit on the size of the
shared memory structure for aborted transactions. I don't think
there'd be any reason to make it particularly small, but if you sat
there and aborted transactions at top speed you might eventually run
out of room, at which point any transactions you started wouldn't be
able to abort until vacuum made enough progress to free up an entry.
5. It would be pretty much impossible to run with autovacuum turned
off, and in fact you would likely need to make it a good deal more
aggressive in the specific case of aborted transactions, to mitigate
problems #1, #3, and #4.

I'm not sure how bad those things would be, or if there are more that
I'm missing (besides the obvious "it would be a lot of work").

Robert Haas
The Enterprise PostgreSQL Company

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message David E. Wheeler 2010-11-09 22:00:29 Re: multi-platform, multi-locale regression tests
Previous Message Josh Berkus 2010-11-09 21:42:46 Re: Protecting against unexpected zero-pages: proposal