Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
Date: 2022-02-25 22:00:12
Message-ID: CAH2-Wzmn2at6HHfo9Oc-P_HcAJSivK=xjGm_+VMHyFY6hJUboA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Feb 24, 2022 at 11:14 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> I am not a fan of the backstop terminology. It's still the reason we need to
> do freezing for correctness reasons.

Thanks for the review!

I'm not wedded to that particular terminology, but I think that we
need something like it. Open to suggestions.

How about limit-based? Something like that?

> It'd make more sense to me to turn it
> around and call the "non-backstop" freezing opportunistic freezing or such.

The problem with that scheme is that it leads to a world where
"standard freezing" is incredibly rare (it often literally never
happens), whereas "opportunistic freezing" is incredibly common. That
doesn't make much sense to me.

We tend to think of 50 million XIDs (the vacuum_freeze_min_age
default) as being not that many. But I think that it can be a huge
number, too. Even then, it's unpredictable -- I suspect that it can
change without very much changing in the application, from the point
of view of users. That's a big part of the problem I'm trying to
address -- freezing outside of aggressive VACUUMs is way too rare (it
might barely happen at all). FreezeLimit/vacuum_freeze_min_age was
designed at a time when there was no visibility map at all, when it
made somewhat more sense as the thing that drives freezing.

Incidentally, this is part of the problem with anti-wraparound vacuums
and freezing debt -- the fact that some quite busy databases take
weeks or months to go through 50 million XIDs (or 200 million)
increases the pain of the eventual aggressive VACUUM. It's not
completely unbounded -- autovacuum_freeze_max_age is not 100% useless
here. But the extent to which that stuff bounds the debt can vary
enormously, for not-very-good reasons.

> > Whenever we opt to
> > "freeze a page", the new page-level algorithm *always* uses the most
> > recent possible XID and MXID values (OldestXmin and oldestMxact) to
> > decide what XIDs/XMIDs need to be replaced. That might sound like it'd
> > be too much, but it only applies to those pages that we actually
> > decide to freeze (since page-level characteristics drive everything
> > now). FreezeLimit is only one way of triggering that now (and one of
> > the least interesting and rarest).
>
> That largely makes sense to me and doesn't seem weird.

I'm very pleased that the main intuition behind 0002 makes sense to
you. That's a start, at least.

> I'm a tad concerned about replacing mxids that have some members that are
> older than OldestXmin but not older than FreezeLimit. It's not too hard to
> imagine that accelerating mxid consumption considerably. But we can probably,
> if not already done, special case that.

Let's assume for a moment that this is a real problem. I'm not sure if
it is or not myself (it's complicated), but let's say that it is. The
problem may be more than offset by the positive impact on relminxmid
advancement. I have placed a large emphasis on enabling
relfrozenxid/relminxmid advancement in every non-aggressive VACUUM,
for a number of reasons -- this is one of the reasons. Finding a way
for every VACUUM operation to be "vacrel->scanned_pages +
vacrel->frozenskipped_pages == orig_rel_pages" (i.e. making *some*
amount of relfrozenxid/relminxmid advancement possible in every
VACUUM) has a great deal of value.

As I said recently on the "do only critical work during single-user
vacuum?" thread, why should the largest tables in databases that
consume too many MXIDs do so evenly, across all their tables? There
are usually one or two large tables, and many more smaller tables. I
think it's much more likely that the largest tables consume
approximately zero MultiXactIds in these databases -- actual
MultiXactId consumption is probably concentrated in just one or two
smaller tables (even when we burn through MultiXacts very quickly).
But we don't recognize these kinds of distinctions at all right now.

Under these conditions, we will have many more opportunities to
advance relminmxid for most of the tables (including the larger
tables) all the way up to current-oldestMxact with the patch series.
Without needing to freeze *any* MultiXacts early (just freezing some
XIDs early) to get that benefit. The patch series is not just about
spreading the burden of freezing, so that non-aggressive VACUUMs
freeze more -- it's also making relfrozenxid and relminmxid more
recent and therefore *reliable* indicators of which tables any
wraparound problems *really* are.

Does that make sense to you? This kind of "virtuous cycle" seems
really important to me. It's a subtle point, so I have to ask.

> > It seems that heap_prepare_freeze_tuple allocates new MXIDs (when
> > freezing old ones) in large part so it can NOT freeze XIDs that it
> > would have been useful (and much cheaper) to remove anyway.
>
> Well, we may have to allocate a new mxid because some members are older than
> FreezeLimit but others are still running. When do we not remove xids that
> would have been cheaper to remove once we decide to actually do work?

My point was that today, on HEAD, there is nothing fundamentally
special about FreezeLimit (aka cutoff_xid) as far as
heap_prepare_freeze_tuple is concerned -- and yet that's the only
cutoff it knows about, really. Why can't we do better, by "exploiting
the difference" between FreezeLimit and OldestXmin?

> > On HEAD, FreezeMultiXactId() doesn't get passed down the VACUUM operation's
> > OldestXmin at all (it actually just gets FreezeLimit passed as its
> > cutoff_xid argument). It cannot possibly recognize any of this for itself.
>
> It does recognize something like OldestXmin in a more precise and expensive
> way - MultiXactIdIsRunning() and TransactionIdIsCurrentTransactionId().

It doesn't look that way to me.

While it's true that FreezeMultiXactId() will call
MultiXactIdIsRunning(), that's only a cross-check. This cross-check is
made at a point where we've already determined that the MultiXact in
question is < cutoff_multi. In other words, it catches cases where a
"MultiXactId < cutoff_multi" Multi contains an XID *that's still
running* -- a correctness issue. Nothing to do with being smart about
avoiding allocating new MultiXacts during freezing, or exploiting the
fact that "FreezeLimit < OldestXmin" (which is almost always true,
very true).

This correctness issue is the same issue discussed in "NB: cutoff_xid
*must* be <= the current global xmin..." comments that appear at the
top of heap_prepare_freeze_tuple. That's all.

> Hm. I guess I'll have to look at the code for it. It doesn't immediately
> "feel" quite right.

I kinda think it might be. Please let me know if you see a problem
with what I've said.

> > oldestXmin and oldestMxact map to the same wall clock time, more or less --
> > that seems like it might be an important distinction, independent of
> > everything else.
>
> Hm. Multis can be kept alive by fairly "young" member xids. So it may not be
> removably (without creating a newer multi) until much later than its creation
> time. So I don't think that's really true.

Maybe what I said above it true, even though (at the same time) I have
*also* created new problems with "young" member xids. I really don't
know right now, though.

> > Final relfrozenxid values must still be >= FreezeLimit in an aggressive
> > VACUUM (FreezeLimit is still used as an XID-age based backstop there).
> > In non-aggressive VACUUMs (where there is still no strict guarantee that
> > relfrozenxid will be advanced at all), we now advance relfrozenxid by as
> > much as we possibly can. This exploits workload conditions that make it
> > easy to advance relfrozenxid by many more XIDs (for the same amount of
> > freezing/pruning work).
>
> Don't we now always advance relfrozenxid as much as we can, particularly also
> during aggressive vacuums?

I just meant "we hope for the best and accept what we can get". Will fix.

> > * FRM_RETURN_IS_MULTI
> > * The return value is a new MultiXactId to set as new Xmax.
> > * (caller must obtain proper infomask bits using GetMultiXactIdHintBits)
> > + *
> > + * "relfrozenxid_out" is an output value; it's used to maintain target new
> > + * relfrozenxid for the relation. It can be ignored unless "flags" contains
> > + * either FRM_NOOP or FRM_RETURN_IS_MULTI, because we only handle multiXacts
> > + * here. This follows the general convention: only track XIDs that will still
> > + * be in the table after the ongoing VACUUM finishes. Note that it's up to
> > + * caller to maintain this when the Xid return value is itself an Xid.
> > + *
> > + * Note that we cannot depend on xmin to maintain relfrozenxid_out.
>
> What does it mean for xmin to maintain something?

Will fix.

> > + * See heap_prepare_freeze_tuple for information about the basic rules for the
> > + * cutoffs used here.
> > + *
> > + * Maintains *relfrozenxid_nofreeze_out and *relminmxid_nofreeze_out, which
> > + * are the current target relfrozenxid and relminmxid for the relation. We
> > + * assume that caller will never want to freeze its tuple, even when the tuple
> > + * "needs freezing" according to our return value.
>
> I don't understand the "will never want to" bit?

I meant "even when it's a non-aggressive VACUUM, which will never want
to wait for a cleanup lock the hard way, and will therefore always
settle for these relfrozenxid_nofreeze_out and
*relminmxid_nofreeze_out values". Note the convention here, which is
relfrozenxid_nofreeze_out is not the same thing as relfrozenxid_out --
the former variable name is used for values in cases where we *don't*
freeze, the latter for values in the cases where we do.

Will try to clear that up.

> > Caller should make temp
> > + * copies of global tracking variables before starting to process a page, so
> > + * that we can only scribble on copies. That way caller can just discard the
> > + * temp copies if it isn't okay with that assumption.
> > + *
> > + * Only aggressive VACUUM callers are expected to really care when a tuple
> > + * "needs freezing" according to us. It follows that non-aggressive VACUUMs
> > + * can use *relfrozenxid_nofreeze_out and *relminmxid_nofreeze_out in all
> > + * cases.
>
> Could it make sense to track can_freeze and need_freeze separately?

You mean to change the signature of heap_tuple_needs_freeze, so it
doesn't return a bool anymore? It just has two bool pointers as
arguments, can_freeze and need_freeze?

I suppose that could make sense. Don't feel strongly either way.

> I may be misreading the diff, but aren't we know continuing to use multi down
> below even if !MultiXactIdIsValid()?

Will investigate.

> Doesn't this mean we unpack the members even if the multi is old enough to
> need freezing? Just to then do it again during freezing? Accessing multis
> isn't cheap...

Will investigate.

> This stanza is repeated a bunch. Perhaps put it in a small static inline
> helper?

Will fix.

> Struct member names starting with an upper case look profoundly ugly to
> me... But this isn't the first one, so I guess... :(

I am in 100% agreement, actually. But you know how it goes...

> I still suspect this will cause a very substantial increase in WAL traffic in
> realistic workloads. It's common to have workloads where tuples are inserted
> once, and deleted once/ partition dropped.

I agree with the principle that this kind of use case should be
accommodated in some way.

> I think we'll have to make this less aggressive or tunable. Random ideas for
> heuristics:

The problem that all of these heuristics have is that they will tend
to make it impossible for future non-aggressive VACUUMs to be able to
advance relfrozenxid. All that it takes is one single all-visible page
to make that impossible. As I said upthread, I think that being able
to advance relfrozenxid (and especially relminmxid) by *some* amount
in every VACUUM has non-obvious value.

Maybe you can address that by changing the behavior of non-aggressive
VACUUMs, so that they are directly sensitive to this. Maybe they don't
skip any all-visible pages when there aren't too many, that kind of
thing. That needs to be in scope IMV.

> I don't know how to parse "thereby making relfrozenxid advancement impossible
> ... will no longer hinder relfrozenxid advancement"?

Will fix.

> > We now consistently avoid leaving behind all-visible (not all-frozen) pages.
> > This (as well as work from commit 44fa84881f) makes relfrozenxid advancement
> > in non-aggressive VACUUMs commonplace.
>
> s/consistently/try to/?

Will fix.

> > The system accumulates freezing debt in proportion to the number of
> > physical heap pages with unfrozen tuples, more or less. Anything based
> > on XID age is likely to be a poor proxy for the eventual cost of
> > freezing (during the inevitable anti-wraparound autovacuum). At a high
> > level, freezing is now treated as one of the costs of storing tuples in
> > physical heap pages -- not a cost of transactions that allocate XIDs.
> > Although vacuum_freeze_min_age and vacuum_multixact_freeze_min_age still
> > influence what we freeze, and when, they effectively become backstops.
> > It may still be necessary to "freeze a page" due to the presence of a
> > particularly old XID, from before VACUUM's FreezeLimit cutoff, though
> > that will be rare in practice -- FreezeLimit is just a backstop now.
>
> I don't really like the "rare in practice" bit. It'll be rare in some
> workloads but others will likely be much less affected.

Maybe. The first time one XID crosses FreezeLimit now will be enough
to trigger freezing the page. So it's still very different to today.

I'll change this, though. It's not important.

> I think this means my concern above about increasing mxid creation rate
> substantially may be warranted.

Can you think of an adversarial workload, to get a sense of the extent
of the problem?

> > + * backstop_cutoff_xid must be <= cutoff_xid, and backstop_cutoff_multi must
> > + * be <= cutoff_multi. When any XID/XMID from before these backstop cutoffs
> > + * is encountered, we set *force_freeze to true, making caller freeze the page
> > + * (freezing-eligible XIDs/XMIDs will be frozen, at least). "Backstop
> > + * freezing" ensures that VACUUM won't allow XIDs/XMIDs to ever get too old.
> > + * This shouldn't be necessary very often. VACUUM should prefer to freeze
> > + * when it's cheap (not when it's urgent).
>
> Hm. Does this mean that we might call heap_prepare_freeze_tuple and then
> decide not to freeze?

Yes. And so heap_prepare_freeze_tuple is now a little more like its
sibling function, heap_tuple_needs_freeze.

> Doesn't that mean we might create new multis over and
> over, because we don't end up pulling the trigger on freezing the page?

> Ah, I guess not. But it'd be nicer if I didn't have to scroll down to the body
> of the function to figure it out...

Will fix.

> I think this commit message needs a good amount of polishing - it's very
> convoluted. It's late and I didn't sleep well, but I've tried to read it
> several times without really getting a sense of what this precisely does.

It received much less polishing than the others.

Think of 0003 like this:

The logic for skipping a range of blocks using the visibility map
works by deciding the next_unskippable_block-wise range of skippable
blocks up front. Later, we actually execute the skipping of this range
of blocks (assuming it exceeds SKIP_PAGES_THRESHOLD). These are two
separate steps.

Right now, we do this:

if (skipping_blocks && blkno < nblocks - 1)
{
/*
* Tricky, tricky. If this is in aggressive vacuum, the page
* must have been all-frozen at the time we checked whether it
* was skippable, but it might not be any more. We must be
* careful to count it as a skipped all-frozen page in that
* case, or else we'll think we can't update relfrozenxid and
* relminmxid. If it's not an aggressive vacuum, we don't
* know whether it was initially all-frozen, so we have to
* recheck.
*/
if (vacrel->aggressive ||
VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
vacrel->frozenskipped_pages++;
continue;
}

The fact that this is conditioned in part on "vacrel->aggressive"
concerns me here. Why should we have a special case for this, where we
condition something on aggressive-ness that isn't actually strictly
related to that? Why not just remember that the range that we're
skipping was all-frozen up-front?

That way non-aggressive VACUUMs are not unnecessarily at a
disadvantage, when it comes to being able to advance relfrozenxid.
What if we end up not incrementing vacrel->frozenskipped_pages when we
easily could have, just because this is a non-aggressive VACUUM? I
think that it's worth avoiding stuff like that whenever possible.
Maybe this particular example isn't the most important one. For
example it probably isn't as bad as the one was fixed by the
lazy_scan_noprune work. But why even take a chance? Seems easier to
remove the special case -- which is what this really is.

> FWIW, I'd really like to get rid of SKIP_PAGES_THRESHOLD. It often ends up
> causing a lot of time doing IO that we never need, completely trashing all CPU
> caches, while not actually causing decent readaead IO from what I've seen.

I am also suspicious of SKIP_PAGES_THRESHOLD. But if we want to get
rid of it, we'll need to be sensitive to how that affects relfrozenxid
advancement in non-aggressive VACUUMs IMV.

Thanks again for the review!

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2022-02-25 23:26:42 Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
Previous Message Hsu, John 2022-02-25 21:52:03 Re: Allow async standbys wait for sync replication (was: Disallow quorum uncommitted (with synchronous standbys) txns in logical replication subscribers)