Re: Defining (and possibly skipping) useless VACUUM operations

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Chris Travers <chris(dot)travers(at)gmail(dot)com>
Subject: Re: Defining (and possibly skipping) useless VACUUM operations
Date: 2021-12-14 20:38:00
Message-ID: CAH2-WznHQWhzzJzTgXj2JTu2bpaw7toO3KsuK1UCa7mzNcEMRg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Dec 14, 2021 at 10:47 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Well I just don't understand why you insist on using the word
> "skipping." I think what we're talking about - or at least what we
> should be talking about - is whether relation_needs_vacanalyze() sets
> *wraparound = true right after the comment that says /* Force vacuum
> if table is at risk of wraparound */. And adding some kind of
> exception to the logic that's there now.

Actually, I agree. Skipping is the wrong term, especially because the
phrase "VACUUM skips..." is already too overloaded. Not necessarily in
vacuumlazy.c itself, but certainly on the mailing list.

> Yeah, I hadn't thought about it from that perspective, but that does
> seem very good. I think it's inevitable that there will be cases where
> that doesn't work out - e.g. you can always force the bad case by
> holding a table lock until your newborn heads off to college, or just
> by overthrottling autovacuum so that it can't get through the database
> in any reasonable amount of time - but it will be nice when it does
> work out, for sure.

Right. But when the patch doesn't manage to totally prevent
anti-wraparound VACUUMs, things still work out a lot better than they
would now. I would expect that in practice this will usually only
happen when non-aggressive autovacuums keep getting canceled. And
sure, it's still not ideal that things have come to that. But because
we now do freezing earlier (when it's relatively inexpensive), and
because we set all-frozen bits incrementally, the anti-wraparound
autovacuum will at least be able to reuse any freezing that we manage
to do in all those canceled autovacuums.

I think that this tends to make anti-wraparound VACUUMs mostly about
not being cancelable -- not so much about reliably advancing
relfrozenxid. I mean it doesn't change the basic rules (there is no
change to the definition of aggressive VACUUM), but in practice I
think that it'll just work that way. Which makes a great deal of
sense. I hope to be able to totally get rid of
vacuum_freeze_table_age.

The freeze map work in PostgreSQL 9.6 was really great, and very
effective. But I think that it had an undesirable interaction with
vacuum_freeze_min_age: if we set a heap page as all-visible (but not
all frozen) before some of its tuples reached that age (which is very
likely), then tuples < vacuum_freeze_min_age aren't going to get
frozen until whenever we do an aggressive autovacuum. Very often, this
will only happen when we next do an anti-wraparound VACUUM (at least
before Postgres 13). I suspect we risk running into a "debt cliff" in
the eventual anti-wraparound autovacuum. And so while
vacuum_freeze_min_age kinda made sense prior to 9.6, it now seems to
make a lot less sense.

> > I guess that that makes avoiding useless vacuuming seem like less of a
> > priority. ISTM that it should be something that is squarely aimed at
> > keeping things stable in truly pathological cases.
>
> Yes. I think "pathological cases" is a good summary of what's wrong
> with autovacuum.

This is 100% my focus, in general. The main goal of the patch I'm
working on isn't so much improving performance as making it more
predictable over time. Focussing on freezing while costs are low has a
natural tendency to spread the costs out over time. The system should
never "get in over its head" with debt that vacuum is expected to
eventually deal with.

> When there's nothing too crazy happening, it actually
> does pretty well. But, when resources are tight or other corner cases
> occur, really dumb things start to happen. So it's reasonable to think
> about how we can install guard rails that prevent complete insanity.

Another thing that I really want to stamp out is anything involving a
tiny, seemingly-insignificant adverse event that has the potential to
cause disproportionate impact over time. For example, right now a
non-aggressive VACUUM will never be able to advance relfrozenxid when
it cannot get a cleanup lock on one heap page. It's actually extremely
unlikely that that should have much of any impact, at least when you
determine the new relfrozenxid for the table intelligently. Not
acquiring one cleanup lock on one heap page on a huge table should not
have such an extreme impact.

It's even worse when the systemic impact over time is considered.
Let's say you only have a 20% chance of failing to acquire one or more
cleanup locks during a non-aggressive autovacuum for a given large
table, meaning that you'll fail to advance relfrozenxid in at least
20% of all non-aggressive autovacuums. I think that that might be a
lot worse than it sounds, because the impact compounds over time --
I'm not sure that 20% is much worse than 60%, or much better than 5%
(very hard to model it). But if we make the high-level, abstract idea
of "aggressiveness" more of a continuous thing, and not something
that's defined by sharp (and largely meaningless) XID-based cutoffs,
we have every chance of nipping these problems in the bud (without
needing to model much of anything).

--
Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Chapman Flack 2021-12-14 20:42:12 Life cycles of tuple descriptors
Previous Message Bossart, Nathan 2021-12-14 20:23:57 Re: O(n) tasks cause lengthy startups and checkpoints