Re: Logical to physical page mapping

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Jan Wieck <JanWieck(at)yahoo(dot)com>
Cc: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical to physical page mapping
Date: 2012-10-29 11:05:39
Message-ID: CA+TgmobE70334xF2-xdPQz-xfc_WN9FQYATsD7QJ=8SAaQNhNw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Oct 27, 2012 at 1:01 AM, Jan Wieck <JanWieck(at)yahoo(dot)com> wrote:
> The reason why we need full_page_writes is that we need to guard against
> torn pages or partial writes. So what if smgr would manage a mapping between
> logical page numbers and their physical location in the relation?

This sounds a lot like http://en.wikipedia.org/wiki/Shadow_paging

According to my copy of Gray and Reuter, shadow paging is in fact a
workable way of providing atomicity and durability, but as of its
writing (1992) shadow paging had been essentially abandoned because it
didn't have very good performance characteristics. One of the big
problems is that you lose locality of reference - e.g. there's nothing
at all sequential about a sequential scan if, below the mapping layer,
the blocks are scattered about the disk, which is a likely outcome, if
they are frequently updated, or in the long run even if they are only
occasionally updated.

It's occurred to me before to think that this might work if we did it,
not at the block level, but at some higher level, with say 64MB
segments. That wouldn't impinge too much on sequential access, but it
would allow vacuum to clip out an entire 64MB segment anywhere in the
relation if it happened to be empty, or perhaps to rewrite a 64MB
segment of a relation without rewriting the whole thing. But it
wouldn't do anything about torn pages.

Another idea that's been previously proposed (and which is used by
MySQL, and previously proposed by VMware for inclusion in PostgreSQL)
for torn-page avoidance is that of a double-write buffer - i.e.
instead of including full page images in WAL, write them to the
double-write buffer; if we crash, start by restoring all the pages
from the double-write buffer; then, replay WAL. This avoids passing
the full-page images through the WAL stream sent from master to slave,
because the slave can have its own double-write buffer. This would
probably also allow slaves to perform restart-points at arbitrary
locations independent of where the master performs checkpoints. In
the patch as proposed, the double-write buffer was kept very small, in
the hopes of keeping it within the presumed BBWC, so that
very-frequent fsyncs would all reference the same pages and therefore
all be absorbed by the cache. This delivers terrible performance
without a BBWC, though, because the fsyncs are so frequent.
Alternatively, you could imagine a large double-write buffer which
only gets flushed once per checkpoint cycle or so - i.e. basically
what we have now, but just separating the FPW traffic from the main
WAL stream.

Indeed, you could extend that a bit futher: why throw out the
double-write buffer just because there's been a checkpoint cycle? In
a workload like pgbench, it seems likely that the same pages will be
written over and over again. You could have a checkpoint whose
purpose is only to minimize the recovery time in cases where no pages
are torn. You could then also have a less frequent "super-checkpoint"
cycle and retain WAL back to the last "super-checkpoint". In the
hopefully-unikely event that we detect a torn page (through a checksum
failure, presumably) then we hunt backwards through WAL (something our
current infrastructure doesn't really support) and find the last FPI
for that torn page and then begin selective replay from that point,
scanning through all of the WAL since the last super-checkpoint and
replaying all and only records pertaining to that page. But when no
pages are torn then you only need to recover from the last "normal"
checkpoint. I have heard reports (on this mailing list, I think) that
Oracle does something like this, but I haven't tried to verify for
myself whether that is in fact the case.

Yet another idea we've tossed around is to make only vacuum records
include FPWs, and have the more common heap insert/update/delete
operations include enough information that they can still be applied
correctly even if the page has been "torn" by the previous replay of
such a record. This would involve modifying the recovery algorithm so
that, until consistency is reached, we replay all records, regardless
of LSN, which would cost some extra I/O, but maybe not too much to
live with? It would also require that, for example, a heap-insert
record mention the line pointer index used for the insertion;
currently, we count on the previous state of the page to tell us that.
For checkpoint cycles of reasonable length, the cost of storing the
line pointer in every WAL record seems like it'll be less than the
cost needing to write an FPI for the page once per checkpoint cycle,
but this isn't certain to be the case for all workloads.

OK, I'll stop babbling now...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2012-10-29 11:20:40 Re: autovacuum truncate exclusive lock round two
Previous Message Satoshi Nagayasu 2012-10-29 03:58:01 Re: New statistics for WAL buffer dirty writes