Per this comment in lazy_scan_heap(), almost all tuple removal these days
happens in heap_page_prune():
* Ordinarily, DEAD tuples would have been removed by
* heap_page_prune(), but it's possible that the tuple
* state changed since heap_page_prune() looked. In
* particular an INSERT_IN_PROGRESS tuple could have
* changed to DEAD if the inserter aborted. So this
* cannot be considered an error condition.
vacuumlazy.c remains responsible for noticing the LP_DEAD line pointers left
by heap_page_prune(), removing corresponding index entries, and marking those
line pointers LP_UNUSED.
Nonetheless, lazy_vacuum_heap() retains the ability to remove actual
HEAPTUPLE_DEAD tuples and reclaim their LP_NORMAL line pointers. This support
gets exercised only in the scenario described in the above comment. For hot
standby, this capability requires its own WAL record, XLOG_HEAP2_CLEANUP_INFO,
to generate the necessary conflicts. There is a bug in lazy_scan_heap()'s
bookkeeping for the xid to place in that WAL record. Each call to
heap_page_prune() simply overwrites vacrelstats->latestRemovedXid, but
lazy_scan_heap() expects it to only ever increase the value. I have a
attached a minimal fix to be backpatched. It has lazy_scan_heap() ignore
heap_page_prune()'s actions for the purpose of this conflict xid, because
heap_page_prune() emitted an XLOG_HEAP2_CLEAN record covering them.
At that point in the investigation, I realized that the cost of being able to
remove entire tuples in lazy_vacuum_heap() greatly exceeds the benefit.
Again, the benefit is being able to remove tuples whose inserting transaction
aborted between the HeapTupleSatisfiesVacuum() call in heap_page_prune() and
the one in lazy_scan_heap(). To make that possible, lazy_vacuum_heap() grabs
a cleanup lock, calls PageRepairFragmentation(), and emits a WAL record for
every page containing LP_DEAD line pointers or HEAPTUPLE_DEAD tuples. If we
take it out of the business of removing tuples, lazy_vacuum_heap() can skip
WAL and update lp_flags under a mere shared lock. The second attached patch,
for HEAD, implements that. Besides optimizing things somewhat, it simplifies
the code and removes rarely-tested branches. (This patch supersedes the
backpatch-oriented patch rather than stacking with it.)
The bookkeeping behind the "page containing dead tuples is marked as
all-visible in relation" warning is also faulty; it only fires when
lazy_heap_scan() saw the HEAPTUPLE_DEAD tuple; again, heap_page_prune() will
be the one to see it in almost every case. I changed the warning to fire
whenever the table cannot be marked all-visible for a reason other than the
presence of too-recent live tuples.
I considered renaming lazy_vacuum_heap() to lazy_heap_clear_dead_items(),
reflecting its narrower role. Ultimately, I left function names unchanged.
This patch conflicts textually with Pavan's "Setting visibility map in
VACUUM's second phase" patch, but I don't see any conceptual incompatibility.
I can't give a simple statement of the performance improvement here. The
XLOG_HEAP2_CLEAN record is fairly compact, so the primary benefit of avoiding
it is the possibility of avoiding a full-page image. For example, if a
checkpoint lands just before the VACUUM and again during the index-cleaning
phase (assume just one such phase in this example), this patch reduces
heap-related WAL volume by almost 50%. Improvements as extreme as 2% and 97%
are possible given other timings of checkpoints relatively to the VACUUM. In
general, expect this to help VACUUMs spanning several checkpoint cycles more
than it helps shorter VACUUMs. I have attached a script I used as a reference
workload for testing different checkpoint timings. There should also be some
improvement from keeping off WALInsertLock, not requiring WAL flushes to evict
from the ring buffer during the lazy_vacuum_heap() phase, and not taking a
second buffer cleanup lock. I did not attempt to quantify those.
 Normally, heap_page_prune() removes the tuple first (leaving an LP_DEAD
line pointer), and vacuumlazy.c removes index entries afterward. When the
removal happens in this order, the XLOG_HEAP2_CLEAN record takes care of
conflicts. However, in the rarely-used code path, we remove the index entries
before removing the tuple. XLOG_HEAP2_CLEANUP_INFO conflicts with standby
snapshots that might need the vanishing index entries.
pgsql-hackers by date
|Next:||From: Peter Eisentraut||Date: 2013-01-08 02:58:45|
|Subject: PL/Python result object str handler|
|Previous:||From: Shigeru Hanada||Date: 2013-01-08 02:47:34|
|Subject: Re: PATCH: optimized DROP of multiple tables within a transaction|