Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Subject: Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
Date: 2021-11-24 01:01:20
Message-ID: CAH2-Wz=xt-Owmk1FEz5hia5dZgjUT5SFhWFdhrjSgBS0=1C6pA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Nov 22, 2021 at 9:49 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > For example, we can definitely afford to wait a few more milliseconds
> > to get a cleanup lock just once
>
> We currently have no infrastructure to wait for an lwlock or pincount for a
> limited time. And at least for the former it'd not be easy to add. It may be
> worth adding that at some point, but I'm doubtful this is sufficient reason
> for nontrivial new infrastructure in very performance sensitive areas.

It was a hypothetical example. To be more practical about it: it seems
likely that we won't really benefit from waiting some amount of time
(not forever) for a cleanup lock in non-aggressive VACUUM, once we
have some of the relfrozenxid stuff we've talked about in place. In a
world where we're smarter about advancing relfrozenxid in
non-aggressive VACUUMs, the choice between waiting for a cleanup lock,
and not waiting (but also not advancing relfrozenxid at all) matters
less -- it's no longer a binary choice.

It's no longer a binary choice because we will have done away with the
current rigid way in which our new relfrozenxid for the relation is
either FreezeLimit, or nothing at all. So far we've only talked about
the case where we can update relfrozenxid with a value that happens to
be much newer than FreezeLimit. If we can do that, that's great. But
what about setting relfrozenxid to an *older* value than FreezeLimit
instead (in a non-aggressive VACUUM)? That's also pretty good! There
is still a decent chance that the final "suboptimal" relfrozenxid that
we determine can be safely set in pg_class at the end of our VACUUM
will still be far more recent than the preexisting relfrozenxid.
Especially with larger tables.

Advancing relfrozenxid should be thought of as a totally independent
thing to freezing tuples, at least in vacuumlazy.c itself. That's
kinda the case today, even, but *explicitly* decoupling advancing
relfrozenxid from actually freezing tuples seems like a good high
level goal for this project.

Remember, FreezeLimit is derived from vacuum_freeze_min_age in the
obvious way: OldestXmin for the VACUUM, minus vacuum_freeze_min_age
GUC/reloption setting. I'm pretty sure that this means that making
autovacuum freeze tuples more aggressively (by reducing
vacuum_freeze_min_age) could have the perverse effect of making
non-aggressive VACUUMs less likely to advance relfrozenxid -- which is
exactly backwards. This effect could easily be missed, even by expert
users, since there is no convenient instrumentation that shows how and
when relfrozenxid is advanced.

> > All of the autovacuums against the accounts table look similar to this
> > one -- you don't see anything about relfrozenxid being advanced
> > (because it isn't).

>> Does that really make
> > sense, though?
>
> Does what make really sense?

Well, my accounts table example wasn't a particularly good one (it was
a conveniently available example). I am now sure that you got the
point I was trying to make here already, based on what you go on to
say about non-aggressive VACUUMs optionally *not* skipping
all-visible-not-all-frozen heap pages in the hopes of advancing
relfrozenxid earlier (more on that idea below, in my response).

On reflection, the simplest way of expressing the same idea is what I
just said about decoupling (decoupling advancing relfrozenxid from
freezing).

> I think pgbench_accounts is just a really poor showcase. Most importantly
> there's no even slightly longer running transactions that hold down the xid
> horizon. But in real workloads thats incredibly common IME. It's also quite
> uncommon in real workloads to huge tables in which all records are
> updated. It's more common to have value ranges that are nearly static, and a
> more heavily changing range.

I agree.

> I think the most interesting cases where using the "measured" horizon will be
> advantageous is anti-wrap vacuums. Those obviously have to happen for rarely
> modified tables, including completely static ones, too. Using the "measured"
> horizon will allow us to reduce the frequency of anti-wrap autovacuums on old
> tables, because we'll be able to set a much more recent relfrozenxid.

That's probably true in practice -- but who knows these days, with the
autovacuum_vacuum_insert_scale_factor stuff? Either way I see no
reason to emphasize that case in the design itself. The "decoupling"
concept now seems like the key design-level concept -- everything else
follows naturally from that.

> This is becoming more common with the increased use of partitioning.

Also with bulk loading. There could easily be a tiny number of
distinct XIDs that are close together in time, for many many rows --
practically one XID, or even exactly one XID.

> No, not quite. We treat anti-wraparound vacuums as an emergency (including
> logging messages, not cancelling). But the only mechanism we have against
> anti-wrap vacuums happening is vacuum_freeze_table_age. But as you say, that's
> not really a "real" mechanism, because it requires an "independent" reason to
> vacuum a table.

Got it.

> I've seen cases where anti-wraparound vacuums weren't a problem / never
> happend for important tables for a long time, because there always was an
> "independent" reason for autovacuum to start doing its thing before the table
> got to be autovacuum_freeze_max_age old. But at some point the important
> tables started to be big enough that autovacuum didn't schedule vacuums that
> got promoted to aggressive via vacuum_freeze_table_age before the anti-wrap
> vacuums.

Right. Not just because they were big; also because autovacuum runs at
geometric intervals -- the final reltuples from last time is used to
determine the point at which av runs this time. This might make sense,
or it might not make any sense -- it all depends (mostly on index
stuff).

> Then things started to burn, because of the unpaced anti-wrap vacuums
> clogging up all IO, or maybe it was the vacuums not cancelling - I don't quite
> remember the details.

Non-cancelling anti-wraparound VACUUMs that (all of a sudden) cause
chaos because they interact badly with automated DDL is one I've seen
several times -- I'm sure you have too. That was what the Manta/Joyent
blogpost I referenced upthread went into.

> Behaviour that lead to a "sudden" falling over, rather than getting gradually
> worse are bad - they somehow tend to happen on Friday evenings :).

These are among our most important challenges IMV.

> Just that autovacuum should have a mechanism to trigger aggressive vacuums
> (i.e. ones that are guaranteed to be able to increase relfrozenxid unless
> cancelled) before getting to the "emergency"-ish anti-wraparound state.

Maybe, but that runs into the problem of needing another GUC that
nobody will ever be able to remember the name of. I consider the idea
of adding a variety of measures that make non-aggressive VACUUM much
more likely to advance relfrozenxid in practice to be far more
promising.

> Or alternatively that we should have a separate threshold for the "harsher"
> anti-wraparound measures.

Or maybe just raise the default of autovacuum_freeze_max_age, which
many people don't change? That might be a lot safer than it once was.
Or will be, once we manage to teach VACUUM to advance relfrozenxid
more often in non-aggressive VACUUMs on Postgres 15. Imagine a world
in which we have that stuff in place, as well as related enhancements
added in earlier releases: autovacuum_vacuum_insert_scale_factor, the
freezemap, and the wraparound failsafe.

These add up to a lot; with all of that in place, the risk we'd be
introducing by increasing the default value of
autovacuum_freeze_max_age would be *far* lower than the risk of making
the same change back in 2006. I bring up 2006 because it was the year
that commit 48188e1621 added autovacuum_freeze_max_age -- the default
hasn't changed since that time.

> I think workloads are a bit more worried than a realistic set of benchmarksk
> that one person can run yourself.

No question. I absolutely accept that I only have to miss one
important detail with something like this -- that just goes with the
territory. Just saying that I have yet to see any evidence that the
bypass-indexes behavior really hurt anything. I do take the idea that
I might have missed something very seriously, despite all this.

> I gave you examples of cases that I see as likely being bitten by this,
> e.g. when the skipped index cleanup prevents IOS scans. When both the
> likely-to-be-modified and likely-to-be-queried value ranges are a small subset
> of the entire data, the 2% threshold can prevent vacuum from cleaning up
> LP_DEAD entries for a long time. Or when all index scans are bitmap index
> scans, and nothing ends up cleaning up the dead index entries in certain
> ranges, and even an explicit vacuum doesn't fix the issue. Even a relatively
> small rollback / non-HOT update rate can start to be really painful.

That does seem possible. But I consider it very unlikely to appear as
a regression caused by the bypass mechanism itself -- not in any way
that was consistent over time. As far as I can tell, autovacuum
scheduling just doesn't operate at that level of precision, and never
has.

I have personally observed that ANALYZE does a very bad job at
noticing LP_DEAD items in tables/workloads where LP_DEAD items (not
DEAD tuples) tend to concentrate [1]. The whole idea that ANALYZE
should count these items as if they were normal tuples seems pretty
bad to me.

Put it this way: imagine you run into trouble with the bypass thing,
and then you opt to disable it on that table (using the INDEX_CLEANUP
reloption). Why should this step solve the problem on its own? In
order for that to work, VACUUM would have to have to know to be very
aggressive about these LP_DEAD items. But there is good reason to
believe that it just won't ever notice them, as long as ANALYZE is
expected to provide reliable statistics that drive autovacuum --
they're just too concentrated for the block-based approach to truly
work.

I'm not minimizing the risk. Just telling you my thoughts on this.

> I'm a bit doubtful that's as important (which is not to say that it's not
> worth doing). For a heavily updated table the max space usage of the line
> pointer array just isn't as big a factor as ending up with only half the
> usable line pointers.

Agreed; by far the best chance we have of improving the line pointer
bloat situation is preventing it in the first place, by increasing
MaxHeapTuplesPerPage. Once we actually do that, our remaining options
are going to be much less helpful -- then it really is mostly just up
to VACUUM.

> And it's harder to diagnose why the
> cleanup isn't happening without knowledge that pages needing cleanup couldn't
> be cleaned up due to pins.
>
> If you want to improve the logic so that we only count pages that would have
> something to clean up, I'd be happy as well. It doesn't have to mean exactly
> what it means today.

It seems like what you really care about here are remaining cases
where our inability to acquire a cleanup lock has real consequences --
you want to hear about it when it happens, however unlikely it may be.
In other words, you want to keep something in log_autovacuum_* that
indicates that "less than the expected amount of work was completed"
due to an inability to acquire a cleanup lock. And so for you, this is
a question of keeping instrumentation that might still be useful, not
a question of how we define things fundamentally, at the design level.

Sound right?

If so, then this proposal might be acceptable to you:

* Remaining DEAD tuples with storage (though not LP_DEAD items from
previous opportunistic pruning) will get counted separately in the
lazy_scan_noprune (no cleanup lock) path. Also count the total number
of distinct pages that were found to contain one or more such DEAD
tuples.

* These two new counters will be reported on their own line in the log
output, though only in the cases where we actually have any such
tuples -- which will presumably be much rarer than simply failing to
get a cleanup lock (that's now no big deal at all, because we now
consistently do certain cleanup steps, and because FreezeLimit isn't
the only viable thing that we can set relfrozenxid to, at least in the
non-aggressive case).

* There is still a limited sense in which the same items get counted
as RECENTLY_DEAD -- though just those aspects that make the overall
design simpler. So the helpful aspects of this are still preserved.

We only need to tell pgstat_report_vacuum() that these items are
"deadtuples" (remaining dead tuples). That can work by having its
caller add a new int64 counter (same new tuple-based counter used for
the new log line) to vacrel->new_dead_tuples. We'd also add the same
new tuple counter in about the same way at the point where we
determine a final vacrel->new_rel_tuples.

So we wouldn't really be treating anything as RECENTLY_DEAD anymore --
pgstat_report_vacuum() and vacrel->new_dead_tuples don't specifically
expect anything about RECENTLY_DEAD-ness already.

> I was thinking of truncations, which I don't think vacuum-reltuples.spec
> tests.

Got it. I'll look into that for v2.

> Maybe. But we've had quite a few bugs because we ended up changing some detail
> of what is excluded in one of the counters, leading to wrong determination
> about whether we scanned everything or not.

Right. But let me just point out that my whole approach is to make
that impossible, by not needing to count pages, except in
scanned_pages (and in frozenskipped_pages + rel_pages). The processing
performed for any page that we actually read during VACUUM should be
uniform (or practically uniform), by definition. With minimal fudging
in the cleanup lock case (because we mostly do the same work there
too).

There should be no reason for any more page counters now, except for
non-critical instrumentation. For example, if you want to get the
total number of pages skipped via the visibility map (not just
all-frozen pages), then you simply subtract scanned_pages from
rel_pages.

> > Fundamentally, this will only work if we decide to only skip all-frozen
> > pages, which (by definition) only happens within aggressive VACUUMs.
>
> Hm? Or if there's just no runs of all-visible pages of sufficient length, so
> we don't end up skipping at all.

Of course. But my point was: who knows when that'll happen?

> On reason for my doubt is the following:
>
> We can set all-visible on a page without a FPW image (well, as long as hint
> bits aren't logged). There's a significant difference between needing to WAL
> log FPIs for every heap page or not, and it's not that rare for data to live
> shorter than autovacuum_freeze_max_age or that limit never being reached.

This sounds like an objection to one specific heuristic, and not an
objection to the general idea. The only essential part is
"opportunistic freezing during vacuum, when the cost is clearly very
low, and the benefit is probably high". And so it now seems you were
making a far more limited statement than I first believed.

Obviously many variations are possible -- there is a spectrum.
Example: a heuristic that makes VACUUM notice when it is going to
freeze at least one tuple on a page, iff the page will be marked
all-visible in any case -- we should instead freeze every tuple on the
page, and mark the page all-frozen, batching work (could account for
LP_DEAD items here too, not counting them on the assumption that
they'll become LP_UNUSED during the second heap pass later on).

If we see these conditions, then the likely explanation is that the
tuples on the heap page happen to have XIDs that are "split" by the
not-actually-important FreezeLimit cutoff, despite being essentially
similar in any way that matters.

If you want to make the same heuristic more conservative: only do this
when no existing tuples are frozen, since that could be taken as a
sign of the original heuristic not quite working on the same heap page
at an earlier stage.

I suspect that even very conservative versions of the same basic idea
would still help a lot.

> Perhaps we can have most of the benefit even without that. If we were to
> freeze whenever it didn't cause an additional FPWing, and perhaps didn't skip
> all-visible but not !all-frozen pages if they were less than x% of the
> to-be-scanned data, we should be able to to still increase relfrozenxid in a
> lot of cases?

I bet that's true. I like that idea.

If we had this policy, then the number of "extra"
visited-in-non-aggressive-vacuum pages (all-visible but not yet
all-frozen pages) could be managed over time through more
opportunistic freezing. This might make it work even better.

These all-visible (but not all-frozen) heap pages could be considered
"tenured", since they have survived at least one full VACUUM cycle
without being unset. So why not also freeze them based on the
assumption that they'll probably stay that way forever? There won't be
so many of the pages when we do this anyway, by definition -- since
we'd have a heuristic that limited the total number (say to no more
than 10% of the total relation size, something like that).

We're smoothing out the work that currently takes place all together
during an aggressive VACUUM this way.

Moreover, there is perhaps a good chance that the total number of
all-visible-not all-frozen heap pages will *stay* low over time, as a
result of this policy actually working -- there may be a virtuous
cycle that totally prevents us from getting an aggressive VACUUM even
once.

> > I have occasionally wondered if the whole idea of reading heap pages
> > with only a pin (and having cleanup locks in VACUUM) is really worth
> > it -- alternative designs seem possible. Obviously that's a BIG
> > discussion, and not one to have right now. But it seems kind of
> > relevant.
>
> With 'reading' do you mean reads-from-os, or just references to buffer
> contents?

The latter.

[1] https://postgr.es/m/CAH2-Wz=9R83wcwZcPUH4FVPeDM4znzbzMvp3rt21+XhQWMU8+g@mail.gmail.com
--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message houzj.fnst@fujitsu.com 2021-11-24 01:21:28 RE: row filtering for logical replication
Previous Message Mark Dilger 2021-11-24 00:58:29 Re: XMAX_LOCK_ONLY and XMAX_COMMITTED (fk/multixact code)