Re: New IndexAM API controlling index vacuum strategies

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New IndexAM API controlling index vacuum strategies
Date: 2021-03-14 03:23:21
Message-ID: CAH2-Wznpywm4qparkQxf2ns3c7BugC4x7VzKjiB8OCYswwu-=g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Mar 12, 2021 at 9:34 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> I agreed that when we're close to overrunning the
> maintnenance_work_mem space, the situation changes. If we skip it in
> even that case, the next vacuum will be likely to use up
> maintenance_work_mem, leading to a second index scan. Which is
> bad.
>
> If this threshold is aimed to avoid a second index scan due to
> overrunning the maintenance_work_mem, using a ratio of
> maintenance_work_mem would be a good idea. On the other hand, if it's
> to avoid accumulating debt affecting the cost of index vacuuming,
> using a ratio of the total heap tuples seems better.

It's both, together. These are two *independent*
considerations/thresholds. At least in the code that decides whether
or not we skip. Either threshold can force a full index scan (index
vacuuming).

What I'm really worried about is falling behind (in terms of the
amount of memory available for TIDs to delete in indexes) without any
natural limit. Suppose we just have the SKIP_VACUUM_PAGES_RATIO
threshold (i.e. no maintenance_work_mem threshold thing). With just
SKIP_VACUUM_PAGES_RATIO there will be lots of tables where index
vacuuming is almost always avoided, which is good. But
SKIP_VACUUM_PAGES_RATIO might be a bit *too* effective. If we have to
do 2 or even 3 scans of the index when we finally get to index
vacuuming then that's not great, it's inefficient -- but it's at least
*survivable*. But what if there are 10, 100, even 1000 bulk delete
calls for each index when it finally happens? That's completely
intolerable.

In other words, I am not worried about debt, exactly. Debt is normal
in moderation. Healthy, even. I am worried about bankruptcy, perhaps
following a rare and extreme event. It's okay to be imprecise, but all
of the problems must be survivable. The important thing to me for a
maintenance_work_mem threshold is that there is *some* limit. At the
same time, it may totally be worth accepting 2 or 3 index scans during
some eventual VACUUM operation if there are many more VACUUM
operations that don't even touch the index -- that's a good deal!
Also, it may actually be inherently necessary to accept a small risk
of having a future VACUUM operation that does multiple scans of each
index -- that is probably a necessary part of skipping index vacuuming
each time.

Think about the cost of index vacuuming (the amount of I/O and the
duration of index vacuuming) as less as less memory is available for
TIDs. It's non-linear. The cost explodes once we're past a certain
point. The truly important thing is to "never get killed by the
explosion".

> The situation where we need to deal with here is a very large table
> that has a lot of dead tuples but those fit in fewer heap pages (less
> than 1% of all heap blocks). In this case, it's likely that the number
> of dead tuples also is relatively small compared to the total heap
> tuples, as you mentioned. If dead tuples fitted in fewer pages but
> accounted for most of all heap tuples in the heap, it would be a more
> serious situation, there would definitely already be other problems.
> So considering those conditions, I agreed to use a ratio of
> maintenance_work_mem as a threshold. Maybe we can increase the
> constant to 70, 80, or so.

You mean 70% of maintenance_work_mem? That seems fine to me. See my
"Why does lazy_vacuum_table_and_indexes() not make one decision for
the entire VACUUM on the first call, and then stick to its decision?"
remarks at the end of this email, though -- maybe it should not be an
explicit threshold at all.

High level philosophical point: In general I think that the algorithm
for deciding whether or not to perform index vacuuming should *not* be
clever. It should also not focus on getting the benefit of skipping
index vacuuming. I think that a truly robust design will be one that
always starts with the assumption that index vacuuming will be
skipped, and then "works backwards" by considering thresholds/reasons
to *not* skip. For example, the SKIP_VACUUM_PAGES_RATIO thing. The
risk of "explosions" or "bankruptcy" can be thought of as a cost here,
too.

We should simply focus on the costs directly, without even trying to
understand the relationship between each of the costs, and without
really trying to understand the benefit to the user from skipping
index vacuuming.

> > Have you thought more about how the index vacuuming skipping can be
> > configured by users? Maybe a new storage param, that works like the
> > current SKIP_VACUUM_PAGES_RATIO constant?
>
> Since it’s unclear to me yet that adding a new storage parameter or
> GUC parameter for this feature would be useful even for future
> improvements in this area, I haven't thought yet about having users
> control skipping index vacuuming. I’m okay with a constant value for
> the threshold for now.

I agree -- a GUC will become obsolete in only a year or two anyway.
And it will be too hard to tune.

Question about your patch: lazy_vacuum_table_and_indexes() can be
called multiple times (when low on maintenance_work_mem). Each time it
is called we decide what to do for that call and that batch of TIDs.
But...why should it work that way? The whole idea of a
SKIP_VACUUM_PAGES_RATIO style threshold doesn't make sense to me if
the code in lazy_vacuum_table_and_indexes() resets npages_deadlp (sets
it to 0) on each call. I think that npages_deadlp should never be
reset during a single VACUUM operation.

npages_deadlp is supposed to be something that we track for the entire
table. The patch actually compares it to the size of the whole table *
SKIP_VACUUM_PAGES_RATIO inside lazy_vacuum_table_and_indexes():

> + if (*npages_deadlp > RelationGetNumberOfBlocks(onerel) * SKIP_VACUUM_PAGES_RATIO)
> + {

> + }

The code that I have quoted here is actually how I expect
SKIP_VACUUM_PAGES_RATIO to work, but I notice an inconsistency:
lazy_vacuum_table_and_indexes() resets npages_deadlp later on, which
makes either the quoted code or the reset code wrong (at least when
VACUUM needs multiple calls to the lazy_vacuum_table_and_indexes()
function). With multiple calls to lazy_vacuum_table_and_indexes() (due
to low memory), we'll be comparing npages_deadlp to the wrong thing --
because npages_deadlp cannot be treated as a proportion of the blocks
in the *whole table*. Maybe the resetting of npages_deadlp would be
okay if you also used the number of heap blocks that were considered
since the last npages_deadlp reset, and then multiply that by
SKIP_VACUUM_PAGES_RATIO (instead of
RelationGetNumberOfBlocks(onerel)). But I suspect that the real
solution is to not reset npages_deadlp at all (without changing the
quoted code, which seems basically okay).

With tables/workloads that the patch helps a lot, we expect that the
SKIP_VACUUM_PAGES_RATIO threshold will *eventually* be crossed by one
of these VACUUM operations, which *finally* triggers index vacuuming.
So not only do we expect npages_deadlp to be tracked at the level of
the entire VACUUM operation -- we might even imagine it growing slowly
over multiple VACUUM operations, perhaps over many months. At least
conceptually -- it should only grow across VACUUM operations, until
index vacuuming finally takes place. That's my mental model for
npages_deadlp, at least. It tracks an easy to understand cost, which,
as I said, is what the
threshold/algorithm/lazy_vacuum_table_and_indexes() should focus on.

Why does lazy_vacuum_table_and_indexes() not make one decision for the
entire VACUUM on the first call, and then stick to its decision? That
seems much cleaner. Another advantage of that approach is that it
might be enough to handle low maintenance_work_mem risks -- perhaps
those can be covered by simply waiting until the first VACUUM
operation that runs out of memory and so requires multiple
lazy_vacuum_table_and_indexes() calls. If at that point we decide to
do index vacuuming throughout the entire vacuum operation, then we
will not allow the table to accumulate many more TIDs than we can
expect to fit in an entire maintenance_work_mem space.

Under this scheme, the "maintenance_work_mem threshold" can be thought
of as an implicit thing (it's not a constant/multiplier or anything)
-- it is >= 100% of maintenance_work_mem, in effect.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Zhihong Yu 2021-03-14 04:06:26 Re: A qsort template
Previous Message wangsh.fnst@fujitsu.com 2021-03-14 02:52:05 RE: [HACKERS] logical decoding of two-phase transactions