Here's finally my attempt at the visibility map, aka. the dead space
map. It's still work-in-progress, but it's time to discuss some design
details in detail. Patch attached, anyway, for reference.
Visibility Map is basically a bitmap, with one bit per heap page, with
'1' for pages that are known to contain only tuples that are visible to
everyone. Such pages don't need vacuuming, because there is no dead
tuples, and the information can also be used to skip visibility checks.
It should allow index-only-scans in the future, 8.5 perhaps, but that's
not part of this patch. The visibility map is stored in a new relation
fork, alongside the main data and the FSM.
Lazy VACUUM only needs to visit pages that are '0' in the visibility
map. This allows partial vacuums, where we only need to scan those parts
of the table that need vacuuming, plus all indexes.
To avoid having to update the visibility map every time a heap page is
updated, I have added a new flag to the heap page header,
PD_ALL_VISIBLE, which indicates the same thing as a set bit in the
visibility map: all tuples on the page are known to be visible to
everyone. When a page is modified, the visibility map only needs to be
updated if PD_ALL_VISIBLE was set. That should make the impact
unnoticeable for use cases with lots of updates, where the visibility
map doesn't help, as only the first update on page after a vacuum needs
to update the visibility map.
As a bonus, I'm using the PD_ALL_VISIBLE flag to skip visibility checks
in sequential scans. That seems to give a small 5-10% speedup on my
laptop, to a simple "SELECT COUNT(*) FROM foo" query, where foo is a
narrow table with just a single integer column, fitting in RAM.
The critical part of this patch is to keep the PD_ALL_VISIBLE flag and
the visibility map up-to-date, avoiding race conditions. An invariant is
maintained: if PD_ALL_VISIBLE flag is *not* set, the corresponding bit
in the visiblity map must also not be set. If PD_ALL_VISIBLE flag is
set, the bit in the visibility map can be set, or not.
To modify a page:
If PD_ALL_VISIBLE flag is set, the bit in the visibility map is cleared
first. The heap page is kept pinned, but not locked, while the
visibility map is updated. We want to avoid holding a lock across I/O,
even though the visibility map is likely to stay in cache. After the
visibility map has been updated, the page is exclusively locked and
modified as usual, and PD_ALL_VISIBLE flag is cleared before releasing
To set the PD_ALL_VISIBLE flag, you must hold an exclusive lock on the
page, while you observe that all tuples on the page are visible to everyone.
To set the bit in the visibility map, you need to hold a cleanup lock on
the heap page. That keeps away other backends trying to clear the bit in
the visibility map at the same time. Note that you need to hold a lock
on the heap page to examine PD_ALL_VISIBLE, otherwise the cleanup lock
doesn't protect from the race condition.
That's how the patch works right now. However, there's a small
performance problem with the current approach: setting the
PD_ALL_VISIBLE flag must be WAL-logged. Otherwise, this could happen:
1. All tuples on a page become visible to everyone. The inserting
transaction committed, for example. A backend sees that and set
2. Vacuum comes along, and sees that there's no work to be done on the
page. It sets the bit in the visibility map.
3. The visibility map page is flushed to disk. The heap page is not, yet.
The bit in the visibility map is now set, but the corresponding
PD_ALL_VISIBLE flag is not, because it never made it to disk.
I'm avoiding that at the moment by only setting PD_ALL_VISIBLE as part
of a page prune operation, and forcing a WAL record to be written even
if no other work is done on the page. The downside of that is that it
can lead to a big increase in WAL traffic after a bulk load, for
example. The first vacuum after the bulk load would have to write a WAL
record for every heap page, even though there's no dead tuples.
One option would be to just ignore that problem for now, and not
WAL-log. As long as we don't use the visibility map for anything like
index-only-scans, it doesn't matter much if there's some bits set that
shouldn't be. It just means that VACUUM will skip some pages that need
vacuuming, but VACUUM FREEZE will eventually catch those. Given how
little time we have until commitfest and feature freeze, that's probably
the most reasonable thing to do. I'll follow up with other solutions to
that problem, but mainly for discussion for 8.5.
Another thing that does need to be fixed, is the way that the extension
and truncation of the visibility map is handled; that's broken in the
current patch. I started working on the patch a long time ago, before
the FSM rewrite was finished, and haven't gotten around fixing that part
yet. We already solved it for the FSM, so we could just follow that
pattern. The way we solved truncation in the FSM was to write a separate
WAL record with the new heap size, but perhaps we want to revisit that
decision, instead of adding again new code to write a third WAL record,
for truncation of the visibility map. smgrtruncate() writes a WAL record
of its own, if any full blocks are truncated away of the FSM, but we
needed a WAL record even if no full blocks are truncated from the FSM
file, because the "tail" of the last remaining FSM page, representing
the truncated away heap pages, still needs to cleared. Visibility map
has the same problem.
One proposal was to piggyback on the smgrtruncate() WAL-record, and call
FreeSpaceMapTruncateRel from smgr_redo(). I considered that ugly from a
modularity point of view; smgr.c shouldn't be calling higher-level
functions. But maybe it wouldn't be that bad, after all. Or, we could
remove WAL-logging from smgrtruncate() altogether, and move it to
RelationTruncate() or another higher-level function, and handle the
WAL-logging and replay there.
There's some side-effects of partial vacuums that also need to be fixed.
First of all, the tuple count stored in pg_class is now wrong: it only
includes tuples from the pages that are visited. VACUUM VERBOSE output
needs to be changed as well to reflect that only some pages were scanned.
- performance testing, to ensure that there's no significant performance
- should add a specialized version of visibilitymap_clear() for WAL
reaply, so that wouldn't have to rely so much on the fake relcache entries.
pgsql-hackers by date
|Next:||From: Tom Lane||Date: 2008-10-27 12:18:07|
|Subject: Re: new correlation metric |
|Previous:||From: Tom Lane||Date: 2008-10-27 12:02:23|
|Subject: Re: WIP patch: convert SQL-language functions to return tuplestores |