Skip site navigation (1) Skip section navigation (2)

Re: crash-safe visibility map, take three

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: crash-safe visibility map, take three
Date: 2010-12-01 16:25:36
Message-ID: AANLkTi=m063OeTLPWzixSDahsO9Aep9uRSs0DHYNMdgp@mail.gmail.com (view raw or flat)
Thread:
Lists: pgsql-hackers
On Wed, Dec 1, 2010 at 10:36 AM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> Oh, we don't update the LSN when we set the PD_ALL_VISIBLE flag?  OK,
> please let me think some more.  Thanks.

As far as I can tell, there are basically two viable solutions on the
table here.

1. Every time we observe a page as all-visible, (a) set the
PD_ALL_VISIBLE bit on the page, without bumping the LSN; (b) set the
bit in the visibility map page, bumping the LSN as usual, and (c) emit
a WAL record indicating the relation and block number.  On redo of
this record, set both the page-level bit and the visibility map bit.
The heap page may hit the disk before the WAL record, but that's OK;
it just might result in a little extra work until some subsequent
operation gets the visibility map bit set.  The visibility map page
page may hit the disk before the heap page, but that's OK too, because
the WAL record will already be on disk due to the LSN interlock.  If a
crash occurs before the heap page is flushed, redo will fix the heap
page.  (The heap page will get flushed as part of the next checkpoint,
if not sooner, so by the time the redo pointer advances past the WAL
record, there's no longer a risk.)

2. Every time we observe a page as all-visible, (a) set the
PD_ALL_VISIBLE bit on the page, without bumping the LSN, (b) set the
bit in the visibility map page, bumping the LSN if a WAL record is
issued (which only happens sometimes, read on), and (c) emit a WAL
record indicating the "chunk" of 128 visibility map bits which
contains the bit we just set - but only if we're now dealing with a
new group of 128 visibility map bits or if a checkpoint has intervened
since the last such record we emitted.  On redo of this record, clear
the visibility map bits in each chunk.  The heap page may hit the disk
before the WAL record, but that's OK for the same reasons as in plan
#1.  The visibility map page may hit the disk before the heap page,
but that's OK too, because the WAL record will already be on disk to
due the LSN interlock.  If a crash occurs before the heap page makes
it to disk, then redo will clear the visibility map bits, leaving them
to be reset by a subsequent VACUUM.

As is typical with good ideas, neither of these seems terribly
complicated in retrospect.  Kudos to Heikki for thinking them up and
explaining them.

After some thought, I think that approach #1 is probably better,
because it propagates visibility map bits to the standby.  During
index-only scans, the standby will have to ignore them during HS
operation just as it currently ignores the PD_ALL_VISIBLE page-level
bit, but if and when the standby is promoted to master, it's important
to have those bits already set, both for index-only scans and also
because, absent that, the first autovacuum on each table will end up
scanning the whole things and dirtying tremendous gobs of data setting
all those bits, which is just the sort of ugly surprise that we don't
want to give people right after they've been forced to perform a
failover.

I think we can improve this a bit further by also introducing a
HEAP_XMIN_FROZEN bit that we set in lieu of overwriting XMIN with
FrozenXID.  This allows us to freeze tuples aggressively - if we want
- without losing any forensic information.  We can then modify the
above algorithm slightly, so that when we observe that a page is all
visible, we not only set PD_ALL_VISIBLE on the page but also
HEAP_XMIN_FROZEN on each tuple.  The WAL record marking the page as
all-visible then doubles as a WAL record marking it frozen,
eliminating the need to dirty the page yet again at anti-wraparound
vacuum time.  It'll still be a net increase in WAL volume (as Heikki
pointed out) but the added WAL volume is small compared with the I/O
involved in writing out the dirty heap pages (as Tom pointed out), so
it should hopefully be OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

pgsql-hackers by date

Next:From: Florian PflugDate: 2010-12-01 16:27:14
Subject: Re: FK's to refer to rows in inheritance child
Previous:From: Tom LaneDate: 2010-12-01 16:17:19
Subject: Re: improving foreign key locks

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group