Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

From: Andres Freund <andres(at)anarazel(dot)de>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Subject: Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
Date: 2021-11-22 19:29:56
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 2021-11-21 18:13:51 -0800, Peter Geoghegan wrote:
> I have heard many stories about anti-wraparound/aggressive VACUUMs
> where the cure (which suddenly made autovacuum workers
> non-cancellable) was worse than the disease (not actually much danger
> of wraparound failure). For example:
> Yes, this problem report is from 2015, which is before we even had the
> freeze map stuff. I still think that the point about aggressive
> VACUUMs blocking DDL (leading to chaos) remains valid.

As I noted below, I think this is a bit of a separate issue than what your
changes address in this patch.

> There is another interesting area of future optimization within
> VACUUM, that also seems relevant to this patch: the general idea of
> *avoiding* pruning during VACUUM, when it just doesn't make sense to
> do so -- better to avoid dirtying the page for now. Needlessly pruning
> inside lazy_scan_prune is hardly rare -- standard pgbench (maybe only
> with heap fill factor reduced to 95) will have autovacuums that
> *constantly* do it (granted, it may not matter so much there because
> VACUUM is unlikely to re-dirty the page anyway).

Hm. I'm a bit doubtful that there's all that many cases where it's worth not
pruning during vacuum. However, it seems much more common for opportunistic
pruning during non-write accesses.

Perhaps checking whether we'd log an FPW would be a better criteria for
deciding whether to prune or not compared to whether we're dirtying the page?
IME the WAL volume impact of FPWs is a considerably bigger deal than
unnecessarily dirtying a page that has previously been dirtied in the same
checkpoint "cycle".

> This patch seems relevant to that area because it recognizes that pruning
> during VACUUM is not necessarily special -- a new function called
> lazy_scan_noprune may be used instead of lazy_scan_prune (though only when a
> cleanup lock cannot be acquired). These pages are nevertheless considered
> fully processed by VACUUM (this is perhaps 99% true, so it seems reasonable
> to round up to 100% true).

IDK, the potential of not having usable space on an overfly fragmented page
doesn't seem that low. We can't just mark such pages as all-visible because
then we'll potentially never reclaim that space.

> Since any VACUUM (not just an aggressive VACUUM) can sometimes advance
> relfrozenxid, we now make non-aggressive VACUUMs work just a little
> harder in order to make that desirable outcome more likely in practice.
> Aggressive VACUUMs have long checked contended pages with only a shared
> lock, to avoid needlessly waiting on a cleanup lock (in the common case
> where the contended page has no tuples that need to be frozen anyway).
> We still don't make non-aggressive VACUUMs wait for a cleanup lock, of
> course -- if we did that they'd no longer be non-aggressive.

IMO the big difference between aggressive / non-aggressive isn't whether we
wait for a cleanup lock, but that we don't skip all-visible pages...

> But we now make the non-aggressive case notice that a failure to acquire a
> cleanup lock on one particular heap page does not in itself make it unsafe
> to advance relfrozenxid for the whole relation (which is what we usually see
> in the aggressive case already).
> This new relfrozenxid optimization might not be all that valuable on its
> own, but it may still facilitate future work that makes non-aggressive
> VACUUMs more conscious of the benefit of advancing relfrozenxid sooner
> rather than later. In general it would be useful for non-aggressive
> VACUUMs to be "more aggressive" opportunistically (e.g., by waiting for
> a cleanup lock once or twice if needed).

What do you mean by "waiting once or twice"? A single wait may simply never
end on a busy page that's constantly pinned by a lot of backends...

> It would also be generally useful if aggressive VACUUMs were "less
> aggressive" opportunistically (e.g. by being responsive to query
> cancellations when the risk of wraparound failure is still very low).

Being canceleable is already a different concept than anti-wraparound
vacuums. We start aggressive autovacuums at vacuum_freeze_table_age, but
anti-wrap only at autovacuum_freeze_max_age. The problem is that the
autovacuum scheduling is way too naive for that to be a significant benefit -
nothing tries to schedule autovacuums so that they have a chance to complete
before anti-wrap autovacuums kick in. All that vacuum_freeze_table_age does is
to promote an otherwise-scheduled (auto-)vacuum to an aggressive vacuum.

This is one of the most embarassing issues around the whole anti-wrap
topic. We kind of define it as an emergency that there's an anti-wraparound
vacuum. But we have *absolutely no mechanism* to prevent them from occurring.

> We now also collect LP_DEAD items in the dead_tuples array in the case
> where we cannot immediately get a cleanup lock on the buffer. We cannot
> prune without a cleanup lock, but opportunistic pruning may well have
> left some LP_DEAD items behind in the past -- no reason to miss those.

This has become *much* more important with the changes around deciding when to
index vacuum. It's not just that opportunistic pruning could have left LP_DEAD
items, it's that a previous vacuum is quite likely to have left them there,
because the previous vacuum decided not to perform index cleanup.

> Only VACUUM can mark these LP_DEAD items LP_UNUSED (no opportunistic
> technique is independently capable of cleaning up line pointer bloat),

One thing we could do around this, btw, would be to aggressively replace
LP_REDIRECT items with their target item. We can't do that in all situations
(somebody might be following a ctid chain), but I think we have all the
information needed to do so. Probably would require a new HTSV RECENTLY_LIVE
state or something like that.

I think that'd be quite a win - we right now often "migrate" to other pages
for modifications not because we're out of space on a page, but because we run
out of itemids (for debatable reasons MaxHeapTuplesPerPage constraints the
number of line pointers, not just the number of actual tuples). Effectively
doubling the number of available line item in common cases in a number of
realistic / common scenarios would be quite the win.

> Note that we no longer report on "pin skipped pages" in VACUUM VERBOSE,
> since there is no barely any real practical sense in which we actually
> miss doing useful work for these pages. Besides, this information
> always seemed to have little practical value, even to Postgres hackers.

-0.5. I think it provides some value, and I don't see why the removal of the
information should be tied to this change. It's hard to diagnose why some dead
tuples aren't cleaned up - a common cause for that on smaller tables is that
nearly all pages are pinned nearly all the time.

I wonder if we could have a more restrained version of heap_page_prune() that
doesn't require a cleanup lock? Obviously we couldn't defragment the page, but
it's not immediately obvious that we need it if we constrain ourselves to only
modify tuple versions that cannot be visible to anybody.

Random note: I really dislike that we talk about cleanup locks in some parts
of the code, and super-exclusive locks in others :(.

> + /*
> + * Aggressive VACUUM (which is the same thing as anti-wraparound
> + * autovacuum for most practical purposes) exists so that we'll reliably
> + * advance relfrozenxid and relminmxid sooner or later. But we can often
> + * opportunistically advance them even in a non-aggressive VACUUM.
> + * Consider if that's possible now.

I don't agree with the "most practical purposes" bit. There's a huge
difference because manual VACUUMs end up aggressive but not anti-wrap once
older than vacuum_freeze_table_age.

> + * NB: We must use orig_rel_pages, not vacrel->rel_pages, since we want
> + * the rel_pages used by lazy_scan_prune, from before a possible relation
> + * truncation took place. (vacrel->rel_pages is now new_rel_pages.)
> + */

I think it should be doable to add an isolation test for this path. There have
been quite a few bugs around the wider topic...

> + if (vacrel->scanned_pages + vacrel->frozenskipped_pages < orig_rel_pages ||
> + !vacrel->freeze_cutoffs_valid)
> + {
> + /* Cannot advance relfrozenxid/relminmxid -- just update pg_class */
> + Assert(!aggressive);
> + vac_update_relstats(rel, new_rel_pages, new_live_tuples,
> + new_rel_allvisible, vacrel->nindexes > 0,
> + InvalidTransactionId, InvalidMultiXactId, false);
> + }
> + else
> + {
> + /* Can safely advance relfrozen and relminmxid, too */
> + Assert(vacrel->scanned_pages + vacrel->frozenskipped_pages ==
> + orig_rel_pages);
> + vac_update_relstats(rel, new_rel_pages, new_live_tuples,
> + new_rel_allvisible, vacrel->nindexes > 0,
> + FreezeLimit, MultiXactCutoff, false);
> + }

I wonder if this whole logic wouldn't become easier and less fragile if we
just went for maintaining the "actually observed" horizon while scanning the
relation. If we skip a page via VM set the horizon to invalid. Otherwise we
can keep track of the accurate horizon and use that. No need to count pages
and stuff.

> @@ -1050,18 +1046,14 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
> bool all_visible_according_to_vm = false;
> LVPagePruneState prunestate;
> - /*
> - * Consider need to skip blocks. See note above about forcing
> - * scanning of last page.
> - */
> -#define FORCE_CHECK_PAGE() \
> - (blkno == nblocks - 1 && should_attempt_truncation(vacrel))
> -
> pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
> update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_SCAN_HEAP,
> blkno, InvalidOffsetNumber);
> + /*
> + * Consider need to skip blocks
> + */
> if (blkno == next_unskippable_block)
> {
> /* Time to advance next_unskippable_block */
> @@ -1110,13 +1102,19 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
> else
> {
> /*
> - * The current block is potentially skippable; if we've seen a
> - * long enough run of skippable blocks to justify skipping it, and
> - * we're not forced to check it, then go ahead and skip.
> - * Otherwise, the page must be at least all-visible if not
> - * all-frozen, so we can set all_visible_according_to_vm = true.
> + * The current block can be skipped if we've seen a long enough
> + * run of skippable blocks to justify skipping it.
> + *
> + * There is an exception: we will scan the table's last page to
> + * determine whether it has tuples or not, even if it would
> + * otherwise be skipped (unless it's clearly not worth trying to
> + * truncate the table). This avoids having lazy_truncate_heap()
> + * take access-exclusive lock on the table to attempt a truncation
> + * that just fails immediately because there are tuples in the
> + * last page.
> */
> - if (skipping_blocks && !FORCE_CHECK_PAGE())
> + if (skipping_blocks &&
> + !(blkno == nblocks - 1 && should_attempt_truncation(vacrel)))
> {
> /*
> * Tricky, tricky. If this is in aggressive vacuum, the page

I find the FORCE_CHECK_PAGE macro decidedly unhelpful. But I don't like
mixing such changes within a larger change doing many other things.

> @@ -1204,156 +1214,52 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
> buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno,
> RBM_NORMAL, vacrel->bstrategy);
> + page = BufferGetPage(buf);
> + vacrel->scanned_pages++;

I don't particularly like doing BufferGetPage() before holding a lock on the
page. Perhaps I'm too influenced by rust etc, but ISTM that at some point it'd
be good to have a crosscheck that BufferGetPage() is only allowed when holding
a page level lock.

> /*
> - * We need buffer cleanup lock so that we can prune HOT chains and
> - * defragment the page.
> + * We need a buffer cleanup lock to prune HOT chains and defragment
> + * the page in lazy_scan_prune. But when it's not possible to acquire
> + * a cleanup lock right away, we may be able to settle for reduced
> + * processing in lazy_scan_noprune.
> */

s/in lazy_scan_noprune/via lazy_scan_noprune/?

> if (!ConditionalLockBufferForCleanup(buf))
> {
> bool hastup;
> - /*
> - * If we're not performing an aggressive scan to guard against XID
> - * wraparound, and we don't want to forcibly check the page, then
> - * it's OK to skip vacuuming pages we get a lock conflict on. They
> - * will be dealt with in some future vacuum.
> - */
> - if (!aggressive && !FORCE_CHECK_PAGE())
> + LockBuffer(buf, BUFFER_LOCK_SHARE);
> +
> + /* Check for new or empty pages before lazy_scan_noprune call */
> + if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, true,
> + vmbuffer))
> {
> - ReleaseBuffer(buf);
> - vacrel->pinskipped_pages++;
> + /* Lock and pin released for us */
> + continue;
> + }

Why isn't this done in lazy_scan_noprune()?

> + if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup))
> + {
> + /* No need to wait for cleanup lock for this page */
> + UnlockReleaseBuffer(buf);
> + if (hastup)
> + vacrel->nonempty_pages = blkno + 1;
> continue;
> }

Do we really need all of buf, blkno, page for both of these functions? Quite
possible that yes, if so, could we add an assertion that
BufferGetBockNumber(buf) == blkno?

> + /* Check for new or empty pages before lazy_scan_prune call */
> + if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, false, vmbuffer))
> {

Maybe worth a note mentioning that we need to redo this even in the aggressive
case, because we didn't continually hold a lock on the page?

> +/*
> + * Empty pages are not really a special case -- they're just heap pages that
> + * have no allocated tuples (including even LP_UNUSED items). You might
> + * wonder why we need to handle them here all the same. It's only necessary
> + * because of a rare corner-case involving a hard crash during heap relation
> + * extension. If we ever make relation-extension crash safe, then it should
> + * no longer be necessary to deal with empty pages here (or new pages, for
> + * that matter).

I don't think it's actually that rare - the window for this is huge. You just
need to crash / immediate shutdown at any time between the relation having
been extended and the new page contents being written out (checkpoint or
buffer replacement / ring writeout). That's often many minutes.

I don't really see that as a realistic thing to ever reliably avoid, FWIW. I
think the overhead would be prohibitive. We'd need to do synchronous WAL
logging while holding the extension lock I think. Um, not fun.

> + * Caller can either hold a buffer cleanup lock on the buffer, or a simple
> + * shared lock.
> + */

Kinda sounds like it'd be incorrect to call this with an exclusive lock, which
made me wonder why that could be true. Perhaps just say that it needs to be
called with at least a shared lock?

> +static bool
> +lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
> + Page page, bool sharelock, Buffer vmbuffer)

It'd be good to document the return value - for me it's not a case where it's
so obvious that it's not worth it.

> +/*
> + * lazy_scan_noprune() -- lazy_scan_prune() variant without pruning
> + *
> + * Caller need only hold a pin and share lock on the buffer, unlike
> + * lazy_scan_prune, which requires a full cleanup lock.

I'd add somethign like "returns whether a cleanup lock is required". Having to
read multiple paragraphs to understand the basic meaning of the return value
isn't great.

> + if (ItemIdIsRedirected(itemid))
> + {
> + *hastup = true; /* page won't be truncatable */
> + continue;
> + }

It's not really new, but this comment is now a bit confusing, because it can
be understood to be about PageTruncateLinePointerArray().

> +
> + /*
> + * We count DEAD and RECENTLY_DEAD tuples in new_dead_tuples.
> + *
> + * lazy_scan_prune only does this for RECENTLY_DEAD tuples,
> + * and never has to deal with DEAD tuples directly (they
> + * reliably become LP_DEAD items through pruning). Our
> + * approach to DEAD tuples is a bit arbitrary, but it seems
> + * better than totally ignoring them.
> + */
> + new_dead_tuples++;
> + break;

Why does it make sense to track DEAD tuples this way? Isn't that going to lead
to counting them over-and-over again? I think it's quite misleading to include
them in "dead bot not yet removable".

> + /*
> + * Now save details of the LP_DEAD items from the page in the dead_tuples
> + * array iff VACUUM uses two-pass strategy case
> + */

Do we really need to have separate code for this in lazy_scan_prune() and

> + }
> + else
> + {
> + /*
> + * We opt to skip FSM processing for the page on the grounds that it
> + * is probably being modified by concurrent DML operations. Seems
> + * best to assume that the space is best left behind for future
> + * updates of existing tuples. This matches what opportunistic
> + * pruning does.

Why can we assume that there concurrent DML rather than concurrent read-only
operations? IME it's much more common for read-only operations to block
cleanup locks than read-write ones (partially because the frequency makes it
easier, partially because cursors allow long-held pins, partially because the
EXCLUSIVE lock of a r/w operation wouldn't let us get here)

I think this is a change mostly in the right direction. But as formulated this
commit does *WAY* too much at once.


Andres Freund

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Bossart, Nathan 2021-11-22 19:50:42 Re: Sequence's value can be rollback after a crashed recovery.
Previous Message Bossart, Nathan 2021-11-22 19:29:37 Re: Improving psql's \password command