crash-safe visibility map, take three

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: crash-safe visibility map, take three
Date: 2010-11-30 04:57:25
Message-ID: AANLkTik49aAd_AY3-HyT+3vmSUuGW7Qy5LaT+q3aLQEi@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Last week, I posted a couple of possible designs for making the
visibility map crash-safe, which did not elicit much comment. Since
this is an important prerequisite to index-only scans, I'm trying
again.

http://archives.postgresql.org/pgsql-hackers/2010-11/msg01474.php
http://archives.postgresql.org/pgsql-hackers/2010-11/msg01529.php

Here's a third possible design.

Instead of representing each page with a single bit in the visibility
map, use two bits. One bit indicates whether all tuples on the page
are all-visible (same as the current bit) - call this VM_ALL_VISIBLE.
The other bit is only used during VACUUM and indicates whether VACUUM
is trying to set the all-visible bit - call this
VM_BECOMING_ALL_VISIBLE. We observe the rule that any operation that
clears PD_ALL_VISIBLE must clear both the VM_ALL_VISIBLE and
VM_BECOMING_ALL_VISIBLE bits for that page in the visibility map.
VACUUM precedes as follows:

1. Pin each visibility map page. If any VM_BECOMING_ALL_VISIBLE bits
are set, take the exclusive content lock for long enough to clear
them.
2. Scan the heap. When a page is observed to be all-visible, set
VM_BECOMING_ALL_VISIBLE and PD_ALL_VISIBLE.
3. Loop over shared buffers and write out every page to the OS which
belongs to the target relation, was marked all-visible in step 2, and
is still dirty. Note that this may require a WAL flush.
4. fsync() the heap.
5. Pin each visibility map page. If any VM_BECOMING_ALL_VISIBLE bits
are set, take the exclusive content lock, clear each such bit, set the
corresponding VM_ALL_VISIBLE bits and XLOG the page.

One might actually want to do steps 2-5 incrementally, in 1GB chunks,
so that you don't fsync() too much of the relation all at once.

If you tilt your head just right, the recurring problem in all of this
is that the heap page and the visibility map page can go to disk in
either order, and we have no way of knowing which one. A more radical
solution to this problem (and, thus, a fourth possible design) would
be to add a field to the buffer descriptor allowing one page to "wire"
another page into shared buffers. If the value is >0, it's the number
of a buffer it's currently wiring. If the value is <0, it's the
number of other buffers that have wired this buffer. A buffer both
wire another buffer and itself be wired at the same time. If the
value is =0, everything's normal. To make this work, though, you'd
have to complicate the checkpoint logic pretty considerably - make
sure all the unwired buffers are written and fsync'd first, thus
unwiring the remaining ones to be written and fsync'd in a second
pass; and there are also possible problems with very small relations,
where the number of wired buffers might grow to an uncomfortably high
percentage of the total. Waving my hands at all this complexity, you
could then make the heap pages wire the visibility map pages.

I can't say I'm totally in love with any of these designs. Anyone
else have any ideas, or any opinions about which one is best?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2010-11-30 05:05:12 Re: DELETE with LIMIT (or my first hack)
Previous Message Andrew Dunstan 2010-11-30 04:25:17 Re: DELETE with LIMIT (or my first hack)