Re: New strategies for freezing, advancing relfrozenxid early

From: Andres Freund <andres(at)anarazel(dot)de>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: New strategies for freezing, advancing relfrozenxid early
Date: 2022-11-16 05:20:23
Message-ID: 20221116052023.xg7pwgvx2t3hrgaj@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2022-11-15 19:02:12 -0800, Peter Geoghegan wrote:
> From 352867c5027fae6194ab1c6480cd326963e201b1 Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <pg(at)bowt(dot)ie>
> Date: Sun, 12 Jun 2022 15:46:08 -0700
> Subject: [PATCH v6 1/6] Add page-level freezing to VACUUM.
>
> Teach VACUUM to decide on whether or not to trigger freezing at the
> level of whole heap pages, not individual tuple fields. OldestXmin is
> now treated as the cutoff for freezing eligibility in all cases, while
> FreezeLimit is used to trigger freezing at the level of each page (we
> now freeze all eligible XIDs on a page when freezing is triggered for
> the page).
>
> This approach decouples the question of _how_ VACUUM could/will freeze a
> given heap page (which of its XIDs are eligible to be frozen) from the
> question of whether it actually makes sense to do so right now.
>
> Just adding page-level freezing does not change all that much on its
> own: VACUUM will still typically freeze very lazily, since we're only
> forcing freezing of all of a page's eligible tuples when we decide to
> freeze at least one (on the basis of XID age and FreezeLimit). For now
> VACUUM still freezes everything almost as lazily as it always has.
> Later work will teach VACUUM to apply an alternative eager freezing
> strategy that triggers page-level freezing earlier, based on additional
> criteria.
> ---
> src/include/access/heapam.h | 42 +++++-
> src/backend/access/heap/heapam.c | 199 +++++++++++++++++----------
> src/backend/access/heap/vacuumlazy.c | 95 ++++++++-----
> 3 files changed, 222 insertions(+), 114 deletions(-)
>
> diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
> index ebe723abb..ea709bf1b 100644
> --- a/src/include/access/heapam.h
> +++ b/src/include/access/heapam.h
> @@ -112,6 +112,38 @@ typedef struct HeapTupleFreeze
> OffsetNumber offset;
> } HeapTupleFreeze;
>
> +/*
> + * State used by VACUUM to track what the oldest extant XID/MXID will become
> + * when determing whether and how to freeze a page's heap tuples via calls to
> + * heap_prepare_freeze_tuple.

Perhaps this could say something like "what the oldest extant XID/MXID
currently is and what it would be if we decide to freeze the page" or such?

> + * The relfrozenxid_out and relminmxid_out fields are the current target
> + * relfrozenxid and relminmxid for VACUUM caller's heap rel. Any and all

"VACUUM caller's heap rel." could stand to be rephrased.

> + * unfrozen XIDs or MXIDs that remain in caller's rel after VACUUM finishes
> + * _must_ have values >= the final relfrozenxid/relminmxid values in pg_class.
> + * This includes XIDs that remain as MultiXact members from any tuple's xmax.
> + * Each heap_prepare_freeze_tuple call pushes back relfrozenxid_out and/or
> + * relminmxid_out as needed to avoid unsafe values in rel's authoritative
> + * pg_class tuple.
> + *
> + * Alternative "no freeze" variants of relfrozenxid_nofreeze_out and
> + * relminmxid_nofreeze_out must also be maintained for !freeze pages.
> + */

relfrozenxid_nofreeze_out isn't really a "no freeze variant" :)

I think it might be better to just always maintain the nofreeze state.

> +typedef struct HeapPageFreeze
> +{
> + /* Is heap_prepare_freeze_tuple caller required to freeze page? */
> + bool freeze;

s/freeze/freeze_required/?

> + /* Values used when page is to be frozen based on freeze plans */
> + TransactionId relfrozenxid_out;
> + MultiXactId relminmxid_out;
> +
> + /* Used by caller for '!freeze' pages */
> + TransactionId relfrozenxid_nofreeze_out;
> + MultiXactId relminmxid_nofreeze_out;
> +
> +} HeapPageFreeze;
> +

Given the number of parameters to heap_prepare_freeze_tuple, why don't we pass
in more of them in via HeapPageFreeze?

> /* ----------------
> * function prototypes for heap access method
> *
> @@ -180,17 +212,17 @@ extern void heap_inplace_update(Relation relation, HeapTuple tuple);
> extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
> TransactionId relfrozenxid, TransactionId relminmxid,
> TransactionId cutoff_xid, TransactionId cutoff_multi,
> + TransactionId limit_xid, MultiXactId limit_multi,
> HeapTupleFreeze *frz, bool *totally_frozen,
> - TransactionId *relfrozenxid_out,
> - MultiXactId *relminmxid_out);
> + HeapPageFreeze *xtrack);

What does 'xtrack' stand for? Xid Tracking?

> * VACUUM caller must assemble HeapFreezeTuple entries for every tuple that we
> * returned true for when called. A later heap_freeze_execute_prepared call
> - * will execute freezing for caller's page as a whole.
> + * will execute freezing for caller's page as a whole. Caller should also
> + * initialize xtrack fields for page as a whole before calling here with first
> + * tuple for the page. See page_frozenxid_tracker comments.

s/should/need to/?

page_frozenxid_tracker appears to be a dangling pointer.

> + * VACUUM calls limit_xid "FreezeLimit", and cutoff_xid "OldestXmin".
> + * (limit_multi is "MultiXactCutoff", and cutoff_multi "OldestMxact".)

Hm. Perhaps we should just rename them if it requires this kind of
explanation? They're really not good names.

> @@ -6524,8 +6524,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
> else
> {
> /* xmin to remain unfrozen. Could push back relfrozenxid_out. */
> - if (TransactionIdPrecedes(xid, *relfrozenxid_out))
> - *relfrozenxid_out = xid;
> + if (TransactionIdPrecedes(xid, xtrack->relfrozenxid_out))
> + xtrack->relfrozenxid_out = xid;
> }
> }

Could use TransactionIdOlder().

> @@ -6563,8 +6564,11 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
> */
> Assert(!freeze_xmax);
> Assert(TransactionIdIsValid(newxmax));
> - if (TransactionIdPrecedes(newxmax, *relfrozenxid_out))
> - *relfrozenxid_out = newxmax;
> + Assert(heap_tuple_would_freeze(tuple, limit_xid, limit_multi,
> + &xtrack->relfrozenxid_nofreeze_out,
> + &xtrack->relminmxid_nofreeze_out));
> + if (TransactionIdPrecedes(newxmax, xtrack->relfrozenxid_out))
> + xtrack->relfrozenxid_out = newxmax;

Perhaps the Assert(heap_tuple_would_freeze()) bit could be handled once at the
end of the routine, for all paths?

> @@ -6731,18 +6751,36 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
> else
> frz->frzflags |= XLH_FREEZE_XVAC;
>
> - /*
> - * Might as well fix the hint bits too; usually XMIN_COMMITTED
> - * will already be set here, but there's a small chance not.
> - */
> + /* Set XMIN_COMMITTED defensively */
> Assert(!(tuple->t_infomask & HEAP_XMIN_INVALID));
> frz->t_infomask |= HEAP_XMIN_COMMITTED;
> +
> + /*
> + * Force freezing any page with an xvac to keep things simple.
> + * This allows totally_frozen tracking to ignore xvac.
> + */
> changed = true;
> + xtrack->freeze = true;
> }
> }

Oh - I totally didn't realize that ->freeze is an out parameter. Seems a bit
odd to have the other fields suffied with _out but not this one?

> @@ -6786,13 +6824,13 @@ heap_execute_freeze_tuple(HeapTupleHeader tuple, HeapTupleFreeze *frz)
> */
> void
> heap_freeze_execute_prepared(Relation rel, Buffer buffer,
> - TransactionId FreezeLimit,
> + TransactionId OldestXmin,
> HeapTupleFreeze *tuples, int ntuples)
> {
> Page page = BufferGetPage(buffer);
>
> Assert(ntuples > 0);
> - Assert(TransactionIdIsValid(FreezeLimit));
> + Assert(TransactionIdIsValid(OldestXmin));
>
> START_CRIT_SECTION();
>
> @@ -6822,11 +6860,10 @@ heap_freeze_execute_prepared(Relation rel, Buffer buffer,
>
> /*
> * latestRemovedXid describes the latest processed XID, whereas
> - * FreezeLimit is (approximately) the first XID not frozen by VACUUM.
> - * Back up caller's FreezeLimit to avoid false conflicts when
> - * FreezeLimit is precisely equal to VACUUM's OldestXmin cutoff.
> + * OldestXmin is the first XID not frozen by VACUUM. Back up caller's
> + * OldestXmin to avoid false conflicts.
> */
> - latestRemovedXid = FreezeLimit;
> + latestRemovedXid = OldestXmin;
> TransactionIdRetreat(latestRemovedXid);
>
> xlrec.latestRemovedXid = latestRemovedXid;

Won't using OldestXmin instead of FreezeLimit potentially cause additional
conflicts? Is there any reason to not compute an accurate value?

> @@ -1634,27 +1639,23 @@ retry:
> continue;
> }
>
> - /*
> - * LP_DEAD items are processed outside of the loop.
> - *
> - * Note that we deliberately don't set hastup=true in the case of an
> - * LP_DEAD item here, which is not how count_nondeletable_pages() does
> - * it -- it only considers pages empty/truncatable when they have no
> - * items at all (except LP_UNUSED items).
> - *
> - * Our assumption is that any LP_DEAD items we encounter here will
> - * become LP_UNUSED inside lazy_vacuum_heap_page() before we actually
> - * call count_nondeletable_pages(). In any case our opinion of
> - * whether or not a page 'hastup' (which is how our caller sets its
> - * vacrel->nonempty_pages value) is inherently race-prone. It must be
> - * treated as advisory/unreliable, so we might as well be slightly
> - * optimistic.
> - */
> if (ItemIdIsDead(itemid))
> {
> + /*
> + * Delay unsetting all_visible until after we have decided on
> + * whether this page should be frozen. We need to test "is this
> + * page all_visible, assuming any LP_DEAD items are set LP_UNUSED
> + * in final heap pass?" to reach a decision. all_visible will be
> + * unset before we return, as required by lazy_scan_heap caller.
> + *
> + * Deliberately don't set hastup for LP_DEAD items. We make the
> + * soft assumption that any LP_DEAD items encountered here will
> + * become LP_UNUSED later on, before count_nondeletable_pages is
> + * reached. Whether the page 'hastup' is inherently race-prone.
> + * It must be treated as unreliable by caller anyway, so we might
> + * as well be slightly optimistic about it.
> + */
> deadoffsets[lpdead_items++] = offnum;
> - prunestate->all_visible = false;
> - prunestate->has_lpdead_items = true;
> continue;
> }

What does this have to do with the rest of the commit? And why are we doing
this?

> @@ -1782,11 +1783,13 @@ retry:
> if (heap_prepare_freeze_tuple(tuple.t_data,
> vacrel->relfrozenxid,
> vacrel->relminmxid,
> + vacrel->OldestXmin,
> + vacrel->OldestMxact,
> vacrel->FreezeLimit,
> vacrel->MultiXactCutoff,
> &frozen[tuples_frozen],
> &tuple_totally_frozen,
> - &NewRelfrozenXid, &NewRelminMxid))
> + &xtrack))
> {
> /* Save prepared freeze plan for later */
> frozen[tuples_frozen++].offset = offnum;
> @@ -1807,9 +1810,33 @@ retry:
> * that will need to be vacuumed in indexes later, or a LP_NORMAL tuple
> * that remains and needs to be considered for freezing now (LP_UNUSED and
> * LP_REDIRECT items also remain, but are of no further interest to us).
> + *
> + * Freeze the page when heap_prepare_freeze_tuple indicates that at least
> + * one XID/MXID from before FreezeLimit/MultiXactCutoff is present.
> */
> - vacrel->NewRelfrozenXid = NewRelfrozenXid;
> - vacrel->NewRelminMxid = NewRelminMxid;
> + if (xtrack.freeze || tuples_frozen == 0)
> + {
> + /*
> + * We're freezing the page. Our final NewRelfrozenXid doesn't need to
> + * be affected by the XIDs that are just about to be frozen anyway.

Seems quite confusing to enter a block with described as "We're freezing the
page." when we're not freezing anything (tuples_frozen == 0).

> + * Note: although we're freezing all eligible tuples on this page, we
> + * might not need to freeze anything (might be zero eligible tuples).
> + */
> + vacrel->NewRelfrozenXid = xtrack.relfrozenxid_out;
> + vacrel->NewRelminMxid = xtrack.relminmxid_out;
> + freeze_all_eligible = true;

I don't really get what freeze_all_eligible is trying to do.

> #ifdef USE_ASSERT_CHECKING
> /* Note that all_frozen value does not matter when !all_visible */
> - if (prunestate->all_visible)
> + if (prunestate->all_visible && lpdead_items == 0)
> {
> TransactionId cutoff;
> bool all_frozen;
> @@ -1849,8 +1876,7 @@ retry:
> if (!heap_page_is_all_visible(vacrel, buf, &cutoff, &all_frozen))
> Assert(false);

Not related to this change, but why isn't this just
Assert(heap_page_is_all_visible(vacrel, buf, &cutoff, &all_frozen))?

> From 8f3b6237affda15101ffb0b88787bfd6bb92e32f Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <pg(at)bowt(dot)ie>
> Date: Mon, 18 Jul 2022 14:35:44 -0700
> Subject: [PATCH v6 2/6] Teach VACUUM to use visibility map snapshot.
>
> Acquire an in-memory immutable "snapshot" of the target rel's visibility
> map at the start of each VACUUM, and use the snapshot to determine when
> and how VACUUM will skip pages.

This should include a description of the memory usage effects.

> This has significant advantages over the previous approach of using the
> authoritative VM fork to decide on which pages to skip. The number of
> heap pages processed will no longer increase when some other backend
> concurrently modifies a skippable page, since VACUUM will continue to
> see the page as skippable (which is correct because the page really is
> still skippable "relative to VACUUM's OldestXmin cutoff").

Why is it an advantage for the number of pages to not increase?

> It also
> gives VACUUM reliable information about how many pages will be scanned,
> before its physical heap scan even begins. That makes it easier to
> model the costs that VACUUM incurs using a top-down, up-front approach.
>
> Non-aggressive VACUUMs now make an up-front choice about VM skipping
> strategy: they decide whether to prioritize early advancement of
> relfrozenxid (eager behavior) over avoiding work by skipping all-visible
> pages (lazy behavior). Nothing about the details of how lazy_scan_prune
> freezes changes just yet, though a later commit will add the concept of
> freezing strategies.
>
> Non-aggressive VACUUMs now explicitly commit to (or decide against)
> early relfrozenxid advancement up-front.

Why?

> VACUUM will now either scan
> every all-visible page, or none at all. This replaces lazy_scan_skip's
> SKIP_PAGES_THRESHOLD behavior, which was intended to enable early
> relfrozenxid advancement (see commit bf136cf6), but left many of the
> details to chance.

The main goal according to bf136cf6 was to avoid defeating OS readahead, so I
think it should be mentioned here.

To me this is something that ought to be changed separately from the rest of
this commit.

> TODO: We don't spill VM snapshots to disk just yet (resource management
> aspects of VM snapshots still need work). For now a VM snapshot is just
> a copy of the VM pages stored in local buffers allocated by palloc().

HEAPBLOCKS_PER_PAGE is 32672 with the defaults. The maximum relation size is
2**32 - 1 blocks. So the max FSM size is 131458 pages, a bit more than 1GB. Is
that correct?

For large relations that are already nearly all-frozen this does add a
noticable amount of overhead, whether spilled to disk or not. Of course
they're also not going to be vacuumed super often, but ...

Perhaps worth turning the VM into a range based description for the snapshot,
given it's a readonly datastructure in local memory? And we don't necessarily
need the all-frozen and all-visible in memory, one should suffice? We don't
even need random access, so it could easily be allocated incrementally, rather
than one large allocation.

Hard to imagine anybody having a multi-TB table without "runs" of
all-visible/all-frozen. I don't think it'd be worth worrying about patterns
that'd be inefficient in a range representation.

> + /*
> + * VACUUM must scan all pages that might have XIDs < OldestXmin in tuple
> + * headers to be able to safely advance relfrozenxid later on. There is
> + * no good reason to scan any additional pages. (Actually we might opt to
> + * skip all-visible pages. Either way we won't scan pages for no reason.)
> + *
> + * Now that OldestXmin and rel_pages are acquired, acquire an immutable
> + * snapshot of the visibility map as well. lazy_scan_skip works off of
> + * the vmsnap, not the authoritative VM, which can continue to change.
> + * Pages that lazy_scan_heap will scan are fixed and known in advance.

Hm. It's a bit sad to compute the snapshot after determining OldestXmin.

We probably should refresh OldestXmin periodically. That won't allow us to get
a more aggressive relfrozenxid, but it'd allow to remove more gunk.

> + *
> + * The exact number of pages that lazy_scan_heap will scan also depends on
> + * our choice of skipping strategy. VACUUM can either choose to skip any
> + * all-visible pages lazily, or choose to scan those same pages instead.

What does it mean to "skip lazily"?

> + /*
> + * Visibility map page copied to local buffer for caller's snapshot.
> + * Caller requires an exact count of all-visible and all-frozen blocks
> + * in the heap relation. Handle that now.

This part of the comment seems like it actually belongs further down?

> + * Must "truncate" our local copy of the VM to avoid incorrectly
> + * counting heap pages >= rel_pages as all-visible/all-frozen. Handle
> + * this by clearing irrelevant bits on the last VM page copied.
> + */

Hm - why would those bits already be set?

> + map = PageGetContents(localvmpage);
> + if (mapBlock == mapBlockLast)
> + {
> + /* byte and bit for first heap page not to be scanned by VACUUM */
> + uint32 truncByte = HEAPBLK_TO_MAPBYTE(rel_pages);
> + uint8 truncOffset = HEAPBLK_TO_OFFSET(rel_pages);
> +
> + if (truncByte != 0 || truncOffset != 0)
> + {
> + /* Clear any bits set for heap pages >= rel_pages */
> + MemSet(&map[truncByte + 1], 0, MAPSIZE - (truncByte + 1));
> + map[truncByte] &= (1 << truncOffset) - 1;
> + }
> +
> + /* Now it's safe to tally bits from this final VM page below */
> + }
> +
> + /* Tally the all-visible and all-frozen counts from this page */
> + umap = (uint64 *) map;
> + for (int i = 0; i < MAPSIZE / sizeof(uint64); i++)
> + {
> + *all_visible += pg_popcount64(umap[i] & VISIBLE_MASK64);
> + *all_frozen += pg_popcount64(umap[i] & FROZEN_MASK64);
> + }
> + }
> +
> + return vmsnap;
> +}

> From 4f5969932451869f0f28295933c28de49a22fdf2 Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <pg(at)bowt(dot)ie>
> Date: Mon, 18 Jul 2022 15:13:27 -0700
> Subject: [PATCH v6 3/6] Add eager freezing strategy to VACUUM.
>
> Avoid large build-ups of all-visible pages by making non-aggressive
> VACUUMs freeze pages proactively for VACUUMs/tables where eager
> vacuuming is deemed appropriate. Use of the eager strategy (an
> alternative to the classic lazy freezing strategy) is controlled by a
> new GUC, vacuum_freeze_strategy_threshold (and an associated
> autovacuum_* reloption). Tables whose rel_pages are >= the cutoff will
> have VACUUM use the eager freezing strategy.

What's the logic behind a hard threshold? Suddenly freezing everything on a
huge relation seems problematic. I realize that never getting all that far
behind is part of the theory, but I don't think that's always going to work.

Wouldn't a better strategy be to freeze a percentage of the relation on every
non-aggressive vacuum? That way the amount of work for an eventual aggressive
vacuum will shrink, without causing individual vacuums to take extremely long.

> When the eager strategy is in use, lazy_scan_prune will trigger freezing
> a page's tuples at the point that it notices that it will at least
> become all-visible -- it can be made all-frozen instead. We still
> respect FreezeLimit, though: the presence of any XID < FreezeLimit also
> triggers page-level freezing (just as it would with the lazy strategy).

The other thing that I think would be to good to use is a) whether the page is
already in s_b, and b) whether the page already is dirty. The cost of freezing
shrinks significantly if it doesn't cause an additional read + write. And that
additional IO IMO one of the major concerns with freezing much more
aggressively in OLTPish workloads where a lot of the rows won't ever get old
enough to need freezing.

> From f2066c8ca5ba1b6f31257a36bb3dd065ecb1e3d4 Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <pg(at)bowt(dot)ie>
> Date: Mon, 5 Sep 2022 17:46:34 -0700
> Subject: [PATCH v6 4/6] Make VACUUM's aggressive behaviors continuous.
>
> The concept of aggressive/scan_all VACUUM dates back to the introduction
> of the visibility map in Postgres 8.4. Before then, every lazy VACUUM
> was "equally aggressive": each operation froze whatever tuples before
> the age-wise cutoff needed to be frozen. And each table's relfrozenxid
> was updated at the end. In short, the previous behavior was much less
> efficient, but did at least have one thing going for it: it was much
> easier to understand at a high level.
>
> VACUUM no longer applies a separate mode of operation (aggressive mode).
> There are still antiwraparound autovacuums, but they're now little more
> than another way that autovacuum.c can launch an autovacuum worker to
> run VACUUM.

The most significant aspect of anti-wrap autvacuums right now is that they
don't auto-cancel. Is that still used? If so, what's the threshold?

IME one of the most common reasons for autovac not keeping up is that the
application occasionally acquires conflicting locks on one of the big
tables. Before reaching anti-wrap age all autovacuums on that table get
cancelled before it gets to update relfrozenxid. Once in that situation
autovac really focusses only on that relation...

> Now every VACUUM might need to wait for a cleanup lock, though few will.
> It can only happen when required to advance relfrozenxid to no less than
> half way between the existing relfrozenxid and nextXID.

Where's that "halfway" bit coming from?

Isn't "half way between the relfrozenxid and nextXID" a problem for instances
with longrunning transactions? Wouldn't this mean that wait for every page if
relfrozenxid can't be advanced much because of a longrunning query or such?

> From 51a863190f70c8baa6d04e3ffd06473843f3326d Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <pg(at)bowt(dot)ie>
> Date: Sun, 31 Jul 2022 13:53:19 -0700
> Subject: [PATCH v6 5/6] Avoid allocating MultiXacts during VACUUM.
>
> Pass down vacuumlazy.c's OldestXmin cutoff to FreezeMultiXactId(), and
> teach it the difference between OldestXmin and FreezeLimit. Use this
> high-level context to intelligently avoid allocating new MultiXactIds
> during VACUUM operations. We should always prefer to avoid allocating
> new MultiXacts during VACUUM on general principle. VACUUM is the only
> mechanism that can claw back MultixactId space, so allowing VACUUM to
> consume MultiXactId space (for any reason) adds to the risk that the
> system will trigger the multiStopLimit wraparound protection mechanism.

Strictly speaking that's not quite true, you can also drop/truncate tables ;)

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Japin Li 2022-11-16 05:21:33 Re: Memory leak in adjust_data_dir
Previous Message kuroda.keisuke 2022-11-16 05:17:59 Re: pg_rewind: warn when checkpoint hasn't happened after promotion