Re: xid wraparound danger due to INDEX_CLEANUP false

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: xid wraparound danger due to INDEX_CLEANUP false
Date: 2020-11-20 19:37:22
Message-ID: CAH2-Wz=+txMTk8TZ5+ipuvpiQfDVrE4Nt6hx7MjDuA66rE5Brw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Nov 20, 2020 at 3:17 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> I had missed your bottom-up index deletion patch but it's a promising
> improvement. With that patch, the number of dead tuples in individual
> indexes may differ. So it's important that we make index vacuuming a
> per-index thing.

Right, but it's important to remember that even bottom-up index
deletion isn't really special in theory. Bottom-up index deletion is
"just" a reliable version of the existing LP_DEAD index deletion thing
(which has been around since Postgres 8.2). In theory it doesn't
change the fundamental nature of the problem. In practice it does,
because it makes it very obvious to pgsql-hackers that indexes on the
same table can have very different needs from VACUUM. And the actual
differences we see are much bigger now. Even still, the fact that you
had certain big differences across indexes on the same table is not a
new thing. (Actually, you can even see this on the master branch in
Victor's bottom-up deletion benchmark, where the primary key index
actually doesn't grow on the master branch, even after 8 hours.)

The bottom-up index deletion patch (and the enhancements we're talking
about here, for VACUUM itself) are based on "the generational
hypothesis" that underlies generational garbage collection. The
philosophy is the same. See:

https://plumbr.io/handbook/garbage-collection-in-java/generational-hypothesis

In theory, "most garbage comes from new objects" is "just" an
empirical observation, that may or may not be true with each
workload/Java program/whatever. In practice it is important enough to
be a big part of how every modern garbage collector works -- it's
almost always true, and even when it isn't true it doesn't actually
hurt to make the assumption that it is true and then be wrong. I
believe that we have to take a holistic view of the problem to make
real progress.

Andres said something similar in a recent blog post:

https://techcommunity.microsoft.com/t5/azure-database-for-postgresql/improving-postgres-connection-scalability-snapshots/ba-p/1806462#interlude-removing-the-need-for-recentglobalxminhorizon

"In most workloads the majority of accesses are to live tuples, and
when encountering non-live tuple versions they are either very old, or
very new."

(This was just a coincidence, but it was good to see that he made the
same observation.)

> Given that patch, it seems to me that it would be better to ask
> individual index AM before calling to bulkdelete about the needs of
> bulkdelete. That is, passing VACUUM options and the number of
> collected dead tuples etc. to index AM, we ask index AM via a new
> index AM API whether it wants to do bulkdelete or not. We call
> bulkdelete for only indexes that answered 'yes'. If we got 'no' from
> any one of the indexes, we cannot have a second heap pass.
> INDEX_CLEANUP is not enforceable. When INDEX_CLEANUP is set to false,
> we expect index AMs to return 'no' unless they have a special reason
> for the needs of bulkdelete.

I don't have a very detailed idea of the interface or anything. There
are a few questions that naturally present themselves, that I don't
have good answers to right now. Obviously vacuumlazy.c will only treat
this feedback from each index as an advisory thing. So what happens
when 50% of the indexes say yes and 50% say no? This is a subproblem
that must be solved as part of this work. Ideally it will be solved by
you. :-)

> One possible benefit of this idea even without bottom-up index
> deleteion patch would be something like
> vacuum_index_cleanup_scale_factor for bulkdelete. For example, in the
> case where the amount of dead tuple is slightly larger than
> maitenance_work_mem the second time calling to bulkdelete will be
> called with a small amount of dead tuples, which is less efficient. If
> an index AM is able to determine not to do bulkdelete by comparing the
> number of dead tuples to a threshold, it can avoid such bulkdelete
> calling.

I agree. Actually, I thought the same thing myself, even before I
realized that bottom-up index deletion was possible.

> Also, as a future extension, once we have retail index deletion
> feature, we might be able to make that API return a ternary value:
> 'no', 'do_bulkdelete', ‘do_indexscandelete, so that index AM can
> choose the appropriate method of index deletion based on the
> statistics.

I agree again!

We may eventually be able to make autovacuum run very frequently
against each table in many important cases, with each VACUUM taking
very little wall-clock time. We don't have to change the fundamental
design to fix most of the current problems. I suspect that the "top
down" nature of VACUUM is sometimes helpful. We just need to
compensate when this design is insufficient. Getting the "best of both
worlds" is possible.

> But for making index vacuuming per-index thing, we need to deal with
> the problem that we cannot know which indexes of the table still has
> index tuples pointing to the collected dead tuple. For example, if an
> index always says 'no' (not need bulkdelete therefore we need to keep
> dead line pointers), the collected dead tuples might already be marked
> as LP_DEAD and there might already not be index tuples pointing to
> them in other index AMs. In that case we don't want to call to
> bulkdelete for other indexes. Probably we can have additional
> statistics like the number of dead tuples in individual indexes so
> that they can determine the needs of bulkdelete. But it’s not a
> comprehensive solution.

Right. Maybe we don't ask the index AMs for discrete yes/no answers.
Maybe we can ask them for a continuous answer, such as a value between
0.0 and 1.0 that represents the urgency/bloat, or something like that.
And so the final yes/no answer that really does have to be made for
the table as a whole (does VACUUM do a second pass over the heap to
make LP_DEAD items into LP_UNUSED items?) can at least consider the
worst case for each index. And maybe the average case, too.

(I am just making random suggestions to stimulate discussion. Don't
take these specific suggestions about the am interface too seriously.)

> > If you are able to pursue this project, in whole or in part, I would
> > definitely be supportive of that. I may be able to commit it. I think
> > that this project has many benefits, not just one or two. It seems
> > strategic.
>
> Thanks, that’s really helpful. I’m going to work on that. Since things
> became complicated by these two features that I proposed I’ll do my
> best to sort out the problem and improve it in PG14.

Excellent! Thank you.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2020-11-20 19:56:50 Re: enable_incremental_sort changes query behavior
Previous Message James Coleman 2020-11-20 18:51:09 Re: enable_incremental_sort changes query behavior