Re: 64-bit XIDs in deleted nbtree pages

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: 64-bit XIDs in deleted nbtree pages
Date: 2021-02-26 00:58:30
Message-ID: CAH2-WzmAzuK_832ytOjsktqNXNQuHTroE7Ndnm094a0dqGmOgg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Feb 25, 2021 at 5:42 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> btvacuumcleanup() has been playing two roles: recycling deleted pages
> and collecting index statistics.

Right.

I pushed the VACUUM VERBOSE "index pages newly deleted"
instrumentation patch earlier - it really isn't complicated or
controversial, so I saw no reason to delay with that.

Attached is v7, which now only has the final patch -- the optimization
that makes it possible for VACUUM to recycle pages that were newly
deleted during the same VACUUM operation. Still no real changes.
Again, I just wanted to keep CFBot happy. I haven't thought about or
improved this final patch recently, and it clearly needs more work to
be ready to commit.

> If we don't want btvacuumcleanup() to collect index statistics, we can
> remove vacuum_cleanup_index_scale_factor (at least from btree
> perspectives), as you mentioned. One thing that may be worth
> mentioning is that the difference between the index statistics taken
> by ANALYZE and btvacuumcleanup() is that the former statistics is
> always an estimation. That’s calculated by compute_index_stats()
> whereas the latter uses the result of an index scan. If
> btvacuumcleanup() doesn’t scan the index and always returns NULL, it
> would become hard to get accurate index statistics, for example in a
> static table case. I've not checked which cases index statistics
> calculated by compute_index_stats() are inaccurate, though.

The historic context makes it easier to understand what to do here --
it makes it clear that amvacuumcleanup() routine does not (or should
not) do any index scan when the index hasn't (and won't) be modified
by the current VACUUM operation. The relevant sgml doc sentence I
quoted to you recently ("It is OK to return NULL if the index was not
changed at all during the VACUUM operation...") was added by commit
e57345975cf in 2006. Much of the relevant 2006 discussion is here,
FWIW:

https://www.postgresql.org/message-id/flat/26433.1146598265%40sss.pgh.pa.us#862ee11c24da63d0282e0025abbad19c

So now we have the formal rules for index AMs, as well as background
information about what various hackers (mostly Tom) were considering
when the rules were written.

> According to the doc, if amvacuumcleanup/btvacuumcleanup returns NULL,
> it means the index is not changed at all. So do_analyze_rel() executed
> by VACUUM ANALYZE also doesn't need to update the index statistics
> even when amvacuumcleanup/btvacuumcleanup returns NULL. No?

Consider hashvacuumcleanup() -- here it is in full (it hasn't really
changed since 2006, when it was updated by that same commit I cited):

/*
* Post-VACUUM cleanup.
*
* Result: a palloc'd struct containing statistical info for VACUUM displays.
*/
IndexBulkDeleteResult *
hashvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
{
Relation rel = info->index;
BlockNumber num_pages;

/* If hashbulkdelete wasn't called, return NULL signifying no change */
/* Note: this covers the analyze_only case too */
if (stats == NULL)
return NULL;

/* update statistics */
num_pages = RelationGetNumberOfBlocks(rel);
stats->num_pages = num_pages;

return stats;
}

Clearly hashvacuumcleanup() was considered by Tom when he revised the
documentation in 2006. Here are some observations about
hashvacuumcleanup() that seem relevant now:

* There is no "analyze_only" handling, just like nbtree.

"analyze_only" is only used by GIN, even now, 15+ years after it was
added. GIN uses it to make autovacuum workers (never VACUUM outside of
an AV worker) do pending list insertions for ANALYZE -- just to make
it happen more often. This is a niche thing -- clearly we don't have
to care about it in nbtree, even if we make btvacuumcleanup() (almost)
always return NULL when there was no btbulkdelete() call.

* num_pages (which will become pg_class.relpages for the index) is not
set when we return NULL -- hashvacuumcleanup() assumes that ANALYZE
will get to it eventually in the case where VACUUM does no real work
(when it just returns NULL).

* We also use RelationGetNumberOfBlocks() to set pg_class.relpages for
index relations during ANALYZE -- it's called when we call
vac_update_relstats() (I quoted this do_analyze_rel() code to you
directly in a recent email).

* In general, pg_class.relpages isn't an estimate (because we use
RelationGetNumberOfBlocks(), both in the VACUUM-updates case and the
ANALYZE-updates case) -- only pg_class.reltuples is truly an estimate
during ANALYZE, and so getting a "true count" seems to have only
limited practical importance.

I think that this sets a precedent in support of my view that we can
simply get rid of vacuum_cleanup_index_scale_factor without any
special effort to maintain pg_class.reltuples. As I said before, we
can safely make btvacuumcleanup() just like hashvacuumcleanup(),
except when there are known deleted-but-not-recycled pages, where a
full index scan really is necessary for reasons that are not related
to statistics at all (of course we still need the *logic* that was
added to nbtree by the vacuum_cleanup_index_scale_factor commit --
that is clearly necessary). My guess is that Tom would have made
btvacuumcleanup() look identical to hashvacuumcleanup() in 2006 if
nbtree didn't have page deletion to consider -- but that had to be
considered.

My reasoning here is also based on the tendency of the core code to
mostly think of hash indexes as very similar to nbtree indexes.

Even though "the letter of the law" favors removing the
vacuum_cleanup_index_scale_factor GUC + param in the way I have
outlined, that is not the only thing that matters -- we must also
consider "the spirit of the law". Realistically, hash indexes are far
less popular than nbtree indexes, and so even if I am 100% correct in
theory, the real world might not be so convinced by my legalistic
argument. We've already seen the issue with VACUUM ANALYZE (which has
not been truly consistent with the behavior hashvacuumcleanup() for
many years). There might be more.

I suppose I could ask Tom what he thinks? The hardest question is what
to do in the backbranches...I really don't have a strong opinion right
now.

> > BTW, note that btvacuumcleanup set pg_class.reltuples to 0 in all
> > cases following the deduplication commit until my bug fix commit
> > 48e12913 (which was kind of a hack itself). This meant that the
> > statistics set by btvacuumcleanup (in the case where btbulkdelete
> > doesn't get called, the relevant case for
> > vacuum_cleanup_index_scale_factor). So it was 100% wrong for months
> > before anybody noticed (or at least until anybody complained).
> >
>
> Maybe we need more regression tests here.

I agree, but my point was that even a 100% broken approach to stats
within btvacuumcleanup() is not that noticeable. This supports the
idea that it just doesn't matter very much if a cleanup-only scan of
the index never takes place (or only takes place when we need to
recycle deleted pages, which is generally rare but will become very
rare once I commit the attached patch).

Also, my fix for this bug (commit 48e12913) was actually pretty bad;
there are now cases where the btvacuumcleanup()-only VACUUM case will
set pg_class.reltuples to a value that is significantly below what it
should be (it all depends on how effective deduplication is with the
data). I probably should have made btvacuumcleanup()-only VACUUMs set
"stats->estimate_count = true", purely to make sure that the core code
doesn't trust the statistics too much (it's okay for VACUUM VERBOSE
output only). Right now we can get a pg_class.reltuples that is
"exactly wrong" -- it would actually be a big improvement if it was
"approximately correct".

Another new concern for me (another concern unique to Postgres 13) is
autovacuum_vacuum_insert_scale_factor-driven autovacuums.

--
Peter Geoghegan

Attachment Content-Type Size
v7-0001-Recycle-pages-deleted-during-same-VACUUM.patch text/x-patch 13.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Nancarrow 2021-02-26 01:53:02 Re: Parallel INSERT (INTO ... SELECT ...)
Previous Message Masahiko Sawada 2021-02-26 00:46:13 Re: Single transaction in the tablesync worker?