Skip site navigation (1) Skip section navigation (2)

Visibility map, partial vacuums

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Visibility map, partial vacuums
Date: 2008-10-27 12:03:35
Message-ID: (view raw or flat)
Lists: pgsql-hackers
Here's finally my attempt at the visibility map, aka. the dead space 
map. It's still work-in-progress, but it's time to discuss some design 
details in detail. Patch attached, anyway, for reference.

Visibility Map is basically a bitmap, with one bit per heap page, with 
'1' for pages that are known to contain only tuples that are visible to 
everyone. Such pages don't need vacuuming, because there is no dead 
tuples, and the information can also be used to skip visibility checks. 
It should allow index-only-scans in the future, 8.5 perhaps, but that's 
not part of this patch. The visibility map is stored in a new relation 
fork, alongside the main data and the FSM.

Lazy VACUUM only needs to visit pages that are '0' in the visibility 
map. This allows partial vacuums, where we only need to scan those parts 
of the table that need vacuuming, plus all indexes.

To avoid having to update the visibility map every time a heap page is 
updated, I have added a new flag to the heap page header, 
PD_ALL_VISIBLE, which indicates the same thing as a set bit in the 
visibility map: all tuples on the page are known to be visible to 
everyone. When a page is modified, the visibility map only needs to be 
updated if PD_ALL_VISIBLE was set. That should make the impact 
unnoticeable for use cases with lots of updates, where the visibility 
map doesn't help, as only the first update on page after a vacuum needs 
to update the visibility map.

As a bonus, I'm using the PD_ALL_VISIBLE flag to skip visibility checks 
in sequential scans. That seems to give a small 5-10% speedup on my 
laptop, to a simple "SELECT COUNT(*) FROM foo" query, where foo is a 
narrow table with just a single integer column, fitting in RAM.

The critical part of this patch is to keep the PD_ALL_VISIBLE flag and 
the visibility map up-to-date, avoiding race conditions. An invariant is 
maintained: if PD_ALL_VISIBLE flag is *not* set, the corresponding bit 
in the visiblity map must also not be set. If PD_ALL_VISIBLE flag is 
set, the bit in the visibility map can be set, or not.

To modify a page:
If PD_ALL_VISIBLE flag is set, the bit in the visibility map is cleared 
first. The heap page is kept pinned, but not locked, while the 
visibility map is updated. We want to avoid holding a lock across I/O, 
even though the visibility map is likely to stay in cache. After the 
visibility map has been updated, the page is exclusively locked and 
modified as usual, and PD_ALL_VISIBLE flag is cleared before releasing 
the lock.

To set the PD_ALL_VISIBLE flag, you must hold an exclusive lock on the 
page, while you observe that all tuples on the page are visible to everyone.

To set the bit in the visibility map, you need to hold a cleanup lock on 
the heap page. That keeps away other backends trying to clear the bit in 
the visibility map at the same time. Note that you need to hold a lock 
on the heap page to examine PD_ALL_VISIBLE, otherwise the cleanup lock 
doesn't protect from the race condition.

That's how the patch works right now. However, there's a small 
performance problem with the current approach: setting the 
PD_ALL_VISIBLE flag must be WAL-logged. Otherwise, this could happen:
1. All tuples on a page become visible to everyone. The inserting 
transaction committed, for example. A backend sees that and set 
2. Vacuum comes along, and sees that there's no work to be done on the 
page. It sets the bit in the visibility map.
3. The visibility map page is flushed to disk. The heap page is not, yet.
4. Crash

The bit in the visibility map is now set, but the corresponding 
PD_ALL_VISIBLE flag is not, because it never made it to disk.

I'm avoiding that at the moment by only setting PD_ALL_VISIBLE as part 
of a page prune operation, and forcing a WAL record to be written even 
if no other work is done on the page. The downside of that is that it 
can lead to a big increase in WAL traffic after a bulk load, for 
example. The first vacuum after the bulk load would have to write a WAL 
record for every heap page, even though there's no dead tuples.

One option would be to just ignore that problem for now, and not 
WAL-log. As long as we don't use the visibility map for anything like 
index-only-scans, it doesn't matter much if there's some bits set that 
shouldn't be. It just means that VACUUM will skip some pages that need 
vacuuming, but VACUUM FREEZE will eventually catch those. Given how 
little time we have until commitfest and feature freeze, that's probably 
the most reasonable thing to do. I'll follow up with other solutions to 
that problem, but mainly for discussion for 8.5.

Another thing that does need to be fixed, is the way that the extension 
and truncation of the visibility map is handled; that's broken in the 
current patch. I started working on the patch a long time ago, before 
the FSM rewrite was finished, and haven't gotten around fixing that part 
yet. We already solved it for the FSM, so we could just follow that 
pattern. The way we solved truncation in the FSM was to write a separate 
WAL record with the new heap size, but perhaps we want to revisit that 
decision, instead of adding again new code to write a third WAL record, 
for truncation of the visibility map. smgrtruncate() writes a WAL record 
of its own, if any full blocks are truncated away of the FSM, but we 
needed a WAL record even if no full blocks are truncated from the FSM 
file, because the "tail" of the last remaining FSM page, representing 
the truncated away heap pages, still needs to cleared. Visibility map 
has the same problem.

One proposal was to piggyback on the smgrtruncate() WAL-record, and call 
FreeSpaceMapTruncateRel from smgr_redo(). I considered that ugly from a 
modularity point of view; smgr.c shouldn't be calling higher-level 
functions. But maybe it wouldn't be that bad, after all. Or, we could 
remove WAL-logging from smgrtruncate() altogether, and move it to 
RelationTruncate() or another higher-level function, and handle the 
WAL-logging and replay there.

There's some side-effects of partial vacuums that also need to be fixed. 
First of all, the tuple count stored in pg_class is now wrong: it only 
includes tuples from the pages that are visited. VACUUM VERBOSE output 
needs to be changed as well to reflect that only some pages were scanned.

Other TODOs
- performance testing, to ensure that there's no significant performance 
- should add a specialized version of visibilitymap_clear() for WAL 
reaply, so that wouldn't have to rely so much on the fake relcache entries.

   Heikki Linnakangas

Attachment: visibilitymap-1.patch
Description: text/x-diff (54.5 KB)


pgsql-hackers by date

Next:From: Tom LaneDate: 2008-10-27 12:18:07
Subject: Re: new correlation metric
Previous:From: Tom LaneDate: 2008-10-27 12:02:23
Subject: Re: WIP patch: convert SQL-language functions to return tuplestores

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group