Re: 64-bit XIDs in deleted nbtree pages

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: 64-bit XIDs in deleted nbtree pages
Date: 2021-02-16 12:16:32
Message-ID: CAD21AoAtUxOLcmrxW_oU1_7+5s56kypQu3Wz5NOeGn=yC7XqGg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 16, 2021 at 3:52 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>
> On Mon, Feb 15, 2021 at 7:26 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> > Actually, there is one more reason why I bring up idea 1 now: I want
> > to hear your thoughts on the index AM API questions now, which idea 1
> > touches on. Ideally all of the details around the index AM VACUUM APIs
> > (i.e. when and where the extra work happens -- btvacuumcleanup() vs
> > btbulkdelete()) won't need to change much in the future. I worry about
> > getting this index AM API stuff right, at least a little.
>
> Speaking of problems like this, I think I spotted an old one: we call
> _bt_update_meta_cleanup_info() in either btbulkdelete() or
> btvacuumcleanup(). I think that we should always call it in
> btvacuumcleanup(), though -- even in cases where there is no call to
> btvacuumscan() inside btvacuumcleanup() (because btvacuumscan()
> happened earlier instead, during the btbulkdelete() call).
>
> This makes the value of IndexVacuumInfo.num_heap_tuples (which is what
> we store in the metapage) much more accurate -- right now it's always
> pg_class.reltuples from *before* the VACUUM started. And so the
> btm_last_cleanup_num_heap_tuples value in a nbtree metapage is often
> kind of inaccurate.
>
> This "estimate during ambulkdelete" issue is documented here (kind of):
>
> /*
> * Struct for input arguments passed to ambulkdelete and amvacuumcleanup
> *
> * num_heap_tuples is accurate only when estimated_count is false;
> * otherwise it's just an estimate (currently, the estimate is the
> * prior value of the relation's pg_class.reltuples field, so it could
> * even be -1). It will always just be an estimate during ambulkdelete.
> */
> typedef struct IndexVacuumInfo
> {
> ...
> }
>
> The name of the metapage field is already
> btm_last_cleanup_num_heap_tuples, which already suggests the approach
> that I propose now. So why don't we do it like that already?
>
> (Thinks some more...)
>
> I wonder: did this detail change at the last minute during the
> development of the feature (just before commit 857f9c36) back in early
> 2018? That change have made it easier to deal with
> oldestBtpoXact/btm_oldest_btpo_xact, which IIRC was a late addition to
> the patch -- so maybe it's truly an accident that the code doesn't
> work the way that I suggest it should already. (It's annoying to make
> state from btbulkdelete() appear in btvacuumcleanup(), unless it's
> from IndexVacuumInfo or something -- I can imagine this changing at
> the last minute, just for that reason.)
>
> Do you think that this needs to be treated as a bug in the
> backbranches, Masahiko? I'm not sure...

Ugh, yes, I think it's a bug.

When developing this feature, in an old version patch, we used to set
invalid values to both btm_oldest_btpo_xact and
btm_last_cleanup_num_heap_tuples in btbulkdelete() to reset these
values. But we decided to set valid values to both even in
btbulkdelete(). I believe that decision was correct in terms of
btm_oldest_btpo_xact because with the old version patch we will do an
unnecessary index scan during btvacuumcleanup(). But it’s wrong in
terms of btm_last_cleanup_num_heap_tuples, as you pointed out.

This bug would make the check of vacuum_cleanup_index_scale_factor
untrust. So I think it’s better to backpatch but I think we need to
note that to fix this issue properly, in a case where a vacuum called
btbulkdelete() earlier, probably we should update only
btm_oldest_btpo_xact in btbulkdelete() and then update
btm_last_cleanup_num_heap_tuples in btvacuumcleanup(). In this case,
we don’t know the oldest btpo.xact among the deleted pages in
btvacuumcleanup(). This means that we would need to update the meta
page twice, leading to WAL logging twice. Since we already could
update the meta page more than once when a vacuum calls btbulkdelete()
multiple times I think it would not be a problem, though.

>
> In any case we should probably make this change as part of Postgres
> 14. Don't you think? It's certainly easy to do it this way now, since
> there will be no need to keep around a oldestBtpoXact value until
> btvacuumcleanup() (in the common case where btbulkdelete() is where we
> call btvacuumscan()). The new btm_last_cleanup_num_delpages field
> (which replaces btm_oldest_btpo_xact) has a value that just comes from
> the bulk stats, which is easy anyway.

Agreed.

As I mentioned above, we might need to consider how btbulkdelete() can
tell btvacuumcleanup() btm_last_cleanup_num_delpages in a case where a
vacuum called btbulkdelete earlier. During parallel vacuum, two
different processes could do btbulkdelete() and btvacuumcleanup()
respectively. Updating those values separately in those callbacks
would be straightforward.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2021-02-16 13:18:59 Re: Parallel INSERT (INTO ... SELECT ...)
Previous Message Matthias van de Meent 2021-02-16 11:39:08 Re: progress reporting for partitioned REINDEX