Re: decoupling table and index vacuum

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: decoupling table and index vacuum
Date: 2022-02-09 19:27:26
Message-ID: CAH2-Wznt2nAemTnoWYU5dpau_x7NMFiXv2LQoZ_LNqso4h069w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 9, 2022 at 6:13 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Just to be clear, when I say that the dead index tuples don't matter
> here, I mean from the point of view of the index. From the point of
> view of the table, the presence of dead index tuples (or even the
> potential presence of dead tuples) pointing to dead line pointers is
> an issue that can drive heap bloat. But from the point of view of the
> index, because we don't ever merge sibling index pages, and because we
> have kill_prior_tuple, there's not much value in freeing up space in
> index pages unless it either prevents a split or lets us free the
> whole page. So I agree with Peter that index growth is what really
> matters.

One small caveat that I'd add is this: heap fragmentation likely makes
it harder to avoid page splits in indexes, to a degree. It is arguably
one cause of the page splits that do happen in a table like
pgbench_tellers, with standard pgbench (and lots of throughput, lots
of clients). The tellers (and also branches) primary key tends to
double in size in my recent tests (still way better than a 20x
increase in size, which is what happened in Postgres 11 and maybe even
13). I think that it might be possible to perfectly preserve the
original index size (even with ANALYZE running now and again) by
setting heap fill factor very low, maybe 50 or less.

This is a minor quibble, though. It still makes sense to think of heap
fragmentation as a problem for the heap itself, and not for indexes at
all, since the effect I describe is relatively insignificant, and just
about impossible to model. The problem really is that the heap pages
are failing to hold their original logical rows in place -- the index
size issue is more of a symptom than a problem unto itself.

> However, I have a concern that Peter's idea to use the previous index
> growth to drive future index vacuuming distinction is retrospective
> rather than prospective. If the index is growing more than it should
> based on the data volume, then evidently we didn't do enough vacuuming
> at some point in the past.

That's a very valid concern. As you know, the great advantage about
retrospectively considering what's not going well (and reducing
everything to some highly informative measure like growth in index
size) is that it's reliable -- you don't have to understand precisely
how things got that way, which is just too complicated to get right.
And as you point out, the great disadvantage is that it has already
happened -- which might already be too late.

More on that later...

> It's reasonable to step up our efforts in
> the present to make sure that the problem doesn't continue, but in
> some sense it's already too late. What we would really like is a
> measure that answers the question: is the index going to bloat in the
> relatively near future if we don't vacuum it now? I think that the
> dead tuple count is trying, however imperfectly, to figure that out.
> All other things being equal, the more dead tuples there are in the
> index, the more bloat we're going to have later if we don't clean them
> out now.

One of the key intuitions behind bottom-up index deletion is to treat
the state of an index page as a dynamic thing, not a static thing. The
information that we take from the page that drives our decisions is
very reliable on average, over time, in the aggregate. At the same
time, the information is very noisy, and could be wrong in important
ways at just about any time. The fundamental idea was to take
advantage of the first property, without ever getting killed by the
second property.

To me this seems conceptually similar to how one manages risk when
playing a game of chance while applying the Kelly criterion. The
criterion provides probabilistic certainty on the best strategy to use
in a situation where we have known favorable odds, an infinite series
of bets, and a personal bankroll. I recently came across a great blog
post about it, which gets the idea across well:

https://explore.paulbutler.org/bet/

If you scroll down to the bottom of the page, there are some general
conclusions, some of which are pretty counterintuitive. Especially
"Maximizing expected value can lead to a bad long-run strategy".
Growth over time is much more important than anything else, since you
can play as many individual games as you like -- provided you never go
bankrupt. I think that *a lot* of problems can be usefully analyzed
that way, or at least benefit from a general awareness of certain
paradoxical aspects of probability. Bankruptcy must be recognized as
qualitatively different to any non-zero bankroll. A little bit like
how splitting a leaf page unnecessarily is truly special.

Once we simply avoid ruin, we can get the benefit of playing a game
with favorable odds -- growth over time is what matters, not
necessarily our current bankroll. It sounds like a fairly small
difference, but that's deceptive -- it's a huge difference.

> The problem is not with that core idea, which IMHO is actually good,
> but that all other things are NOT equal.

...now to get back to talking about VACUUM itself, and to this project.

I couldn't agree more -- all other things are NOT equal. We need a way
to manage the risk that things could change quickly when an index that
we believe has many dead tuples hasn't grown at all just yet. It's
probably also true that we should try to predict the near future, and
not 100% rely on the fact that what we've been doing seems to have
worked so far -- I do accept that.

We should probably dispense with the idea that we'll be making these
decisions about what to do with an index like this (bloated in a way
that bottom-up index deletion just won't help with) in an environment
that is similar to how the current "skip index scan when # heap pages
with one or more LP_DEAD items < 2% of rel_pages" thing. That
mechanism has to be very conservative because we just don't know when
the next opportunity to vacuum indexes will be -- we almost have to
assume that the decision will be static, and made exactly once, so we
better be defensive. But why should that continue to be true with the
conveyor belt stuff in place, and with periodic mini-vacuums that
coordinate over time? I don't think it has to be like that. We can
make it much more dynamic.

I can imagine a two-way dialog between the index and between
vacuumlazy.c that takes place over time. The index AM might be able to
report something along the lines of:

"While I think that this index has more dead index tuples then it
really should, the fact is that it hasn't grown at all, even by one
single leaf page. And so don't you [vacuumlazy.c] should not make me
vacuum the index right now. But be careful -- check back in again in
another minute or two, because the situation must be assumed to be
pretty volatile for now."

This is just an example, not a concrete design. My point is that
approaching the problem dynamically makes it *vastly* easier to do the
right thing. It's far easier to manage the consequences of being wrong
than it is to be right all the time. We're going to be wrong anyway,
so better to be wrong on our own terms.

> I'm not hung up on using the # of dead tuples specifically as the
> metric for index vacuuming, and it may be best to pick some other
> measure. But I am a little suspicious that if the only measure is past
> index growth, we will let some situations go too far before we wake up
> and do something about them. My intuition is that it would be a good
> idea to come up with something we could measure, even if it's
> imperfect, that would give us some clue that trouble is brewing before
> pages actually start splitting. Now maybe my intuition is wrong and
> there is nothing better, but I think it's worth a thought.

We will need something like that. I think that LP_DEAD items (or
would-be LP_DEAD items -- tuples with storage that would get pruned
into LP_DEAD items if we were to prune) in the table are much more
interesting than dead heap-only tuples, and also more interesting that
dead index tuples. Especially the distribution of such LP_DEAD items
in the table, and their concentration. That does seem much more likely
to be robust as a quantitative driver of index vacuuming.

In the extreme case when there are a huge amount of LP_DEAD items in
the table, then we're going to want to make them LP_UNUSED anyway,
which implies that we'll do index vacuuming to make it safe. Since
that's already going to be true, maybe we should try to find a way to
usefully scale the behavior, so that maybe some indexes are vacuumed
sooner when the number of LP_DEAD items is increasing. Not really
sure, but that seems more promising than anything else.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2022-02-09 19:30:08 Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
Previous Message Melanie Plageman 2022-02-09 18:49:30 Re: Avoiding smgrimmedsync() during nbtree index builds