Re: visibility map

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: visibility map
Date: 2010-11-23 15:51:01
Message-ID: AANLkTimGPG+D=7g=MLDw+Yi7jhE6Tg3RphV+Z8PBJNNd@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Nov 23, 2010 at 3:42 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> That's an interesting idea. You pickyback setting the vm bit on the freeze
> WAL record, on the assumption that you have to write the freeze record
> anyway. However, if that assumption doesn't hold, because the tuples are
> deleted before they reach vacuum_freeze_min_age, it's no better than the
> naive approach of WAL-logging the vm bit set separately. Whether that's
> acceptable or not, I don't know.

I don't know, either. I was trying to think of the cases where this
would generate a net increase in WAL before I sent the email, but
couldn't fully wrap my brain around it at the time. Thanks for
summarizing.

Here's another design to poke holes in:

1. Imagine that the visibility map is divided into granules. For the
sake of argument let's suppose there are 8K bits per granule; thus
each granule covers 64M of the underlying heap and 1K of space in the
visibility map itself.

2. In shared memory, create a new array called the visibility vacuum
array (VVA), each element of which has room for a backend ID, a
relfilenode, a granule number, and an LSN. Before setting bits in the
visibility map, a backend is required to allocate a slot in this
array, XLOG the slot allocation, and fill in its backend ID,
relfilenode number, and the granule number whose bits it will be
manipulating, plus the LSN of the slot allocation XLOG record. It
then sets as many bits within that granule as it likes. When done, it
sets the backend ID of the VVA slot to InvalidBackendId but does not
remove it from the array immediately; such a slot is said to have been
"released".

3. When visibility map bits are set, the LSN of the page is set to the
new-VVA-slot XLOG record, so that the visibility map page can't hit
the disk before the new-VVA-slot XLOG record. Also, the contents of
the VVA, sans backend IDs, are XLOG'd at each checkpoint. Thus, on
redo, we can compute a list of all VVA slots for which visibility-bit
changes might already be on disk; we go through and clear both the
visibility map bit and the PD_ALL_VISIBLE bits on the underlying
pages.

4. To free a VVA slot that has been released, we must xlogflush as far
as the record that allocated the slot and sync the visibility map and
heap segments containing that granule. Thus, all slots released
before a checkpoint starts can be freed after it completes.
Alternatively, an individual backend can free a previously-released
slot by perfoming the xlog flush and syncs itself. (This might
require a few more bookkeeping details to be stored in the VVA, but it
seems manageable.)

One problem with this design is that the visibility map bits never get
set on standby servers. If we don't XLOG setting the bit then I
suppose that doesn't happen now either, but it's more sucky (that's
the technical term) if you're relying on it for index-only scans
(which are also relevant on the standby, either during HS or if
promoted) versus if you're only relying on it for vacuum (which
doesn't happen on the standby anyway unless and until it's promoted).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2010-11-23 15:54:25 Re: [JDBC] Support for JDBC setQueryTimeout, et al.
Previous Message Alvaro Herrera 2010-11-23 15:22:59 Re: GiST seems to drop left-branch leaf tuples