Re: 64-bit XIDs in deleted nbtree pages

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: 64-bit XIDs in deleted nbtree pages
Date: 2021-02-13 04:38:18
Message-ID: CAD21AoBWhbKJjV4tZ0EretU0GJn+OuTJwD-3X_1kaQ3rL_GnAA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Feb 11, 2021 at 12:10 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>
> On Wed, Feb 10, 2021 at 2:20 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > Thank you for working on this!
>
> I'm glad that I finally found time for it! It seems like it'll make
> things easier elsewhere.
>
> Attached is v3 of the index. I'll describe the changes I made in more
> detail in my response to your points below.
>
> > I agree that btm_oldest_btpo_xact will no longer be necessary in terms
> > of recycling deleted pages.
>
> Cool.
>
> > Given that we can guarantee that deleted pages never be leaked by
> > using 64-bit XID, I also think we don't need this value. We can do
> > amvacuumcleanup only if the table receives enough insertions to update
> > the statistics (i.g., vacuum_cleanup_index_scale_factor check). I
> > think this is a more desirable behavior. Not skipping amvacuumcleanup
> > if there is even one deleted page that we can recycle is very
> > conservative.
> >
> > Considering your idea of keeping newly deleted pages in the meta page,
> > I can see a little value that keeping btm_oldest_btpo_xact and making
> > it 64-bit XID. I described below.
>
> > Interesting.
> >
> > I like this idea that triggers btvacuumscan() if there are many newly
> > deleted pages. I think this would be helpful especially for the case
> > of bulk-deletion on the table. But why we use the number of *newly*
> > deleted pages but not the total number of deleted pages in the index?
>
> I was unclear here -- I should not have said "newly deleted" pages at
> all. What I actually do when calling _bt_vacuum_needs_cleanup() is
> this (from v3, at the end of btvacuumscan()):
>
> - _bt_update_meta_cleanup_info(rel, vstate.oldestBtpoXact,
> + Assert(stats->pages_deleted >= stats->pages_free);
> + pages_deleted_not_free = stats->pages_deleted - stats->pages_free;
> + _bt_update_meta_cleanup_info(rel, pages_deleted_not_free,
> info->num_heap_tuples);
>
> We're actually passing something I have called
> "pages_deleted_not_free" here, which is derived from the bulk delete
> stats in the obvious way that you see here (subtraction). I'm not
> using pages_newly_deleted at all now. Note also that the behavior
> inside _bt_update_meta_cleanup_info() no longer varies based on
> whether it is called during btvacuumcleanup() or during btbulkdelete()
> -- the same rules apply either way. We want to store
> pages_deleted_not_free in the metapage at the end of btvacuumscan(),
> no matter what.
>
> This same pages_deleted_not_free information is now used by
> _bt_vacuum_needs_cleanup() in an obvious and simple way: if it's too
> high (over 2.5%), then that will trigger a call to btbulkdelete() (we
> won't skip scanning the index). Though in practice it probably won't
> come up that often -- there just aren't ever that many deleted pages
> in most indexes.

Thanks for your explanation. That makes sense to me.

>
> > IIUC if several btbulkdelete executions deleted index pages less than
> > 2% of the index and those deleted pages could not be recycled yet,
> > then the number of recyclable pages would exceed 2% of the index in
> > total but amvacuumcleanup() would not trigger btvacuumscan() because
> > the last newly deleted pages are less than the 2% threshold. I might
> > be missing something though.
>
> I think you're right -- my idea of varying the behavior of
> _bt_update_meta_cleanup_info() based on whether it's being called
> during btvacuumcleanup() or during btbulkdelete() was a bad idea (FWIW
> half the problem was that I explained the idea badly to begin with).
> But, as I said, it's fixed in v3: we simply pass
> "pages_deleted_not_free" as an argument to _bt_vacuum_needs_cleanup()
> now.
>
> Does that make sense? Does it address this concern?

Yes!

>
> > Also, we need to note that having newly deleted pages doesn't
> > necessarily mean these always are recyclable at that time. If the
> > global xmin is still older than deleted page's btpo.xact values, we
> > still could not recycle them. I think btm_oldest_btpo_xact probably
> > will help this case. That is, we store the oldest btpo.xact among
> > those newly deleted pages to btm_oldest_btpo_xact and we trigger
> > btvacuumscan() if there are many newly deleted pages (more than 2% of
> > index) and the btm_oldest_btpo_xact is older than the global xmin (I
> > suppose each newly deleted pages could have different btpo.xact).
>
> I agree that having no XID in the metapage creates a new small
> problem. Specifically, there are certain narrow cases that can cause
> confusion in _bt_vacuum_needs_cleanup(). These cases didn't really
> exist before my patch (kind of).
>
> The simplest example is easy to run into when debugging the patch on
> your laptop. Because you're using your personal laptop, and not a real
> production server, there will be no concurrent sessions that might
> consume XIDs. You can run VACUUM VERBOSE manually several times, but
> that alone will never be enough to enable VACUUM to recycle any of the
> pages that the first VACUUM manages to delete (many to mark deleted,
> reporting the pages as "newly deleted" via the new instrumentation
> from the second patch). Note that the master branch is *also* unable
> to recycle these deleted pages, simply because the "safe xid" never
> gets old because there are no newly allocated XIDs to make it look old
> (there are no allocated XIDs just because nothing else happens). That
> in itself is not the new problem.
>
> The new problem is that _bt_vacuum_needs_cleanup() will no longer
> notice that the oldest XID among deleted-but-not-yet-recycled pages is
> so old that it will not be able to recycle the pages anyway -- at
> least not the oldest page, though in this specific case that will
> apply to all deleted pages equally. We might as well not bother trying
> yet, which the old code "gets right" -- but it doesn't get it right
> for any good reason. That is, the old code won't have VACUUM scan the
> index at all, so it "wins" in this specific scenario.

I'm on the same page.

>
> I think that's okay, though -- it's not a real problem, and actually
> makes sense and has other advantages. This is why I believe it's okay:
>
> * We really should never VACUUM the same table before even one or two
> XIDs are allocated -- that's what happens in the simple laptop test
> scenario that I described. Surely we should not be too concerned about
> "doing the right thing" under this totally artificial set of
> conditions.

Right.

>
> (BTW, I've been using txid_current() for my own "laptop testing", as a
> way to work around this issue.)
>
> * More generally, if you really can't do recycling of pages that you
> deleted during the last VACUUM during this VACUUM (perhaps because of
> the presence of a long-running xact that holds open a snapshot), then
> you have lots of *huge* problems already, and this is the least of
> your concerns. Besides, at that point an affected VACUUM will be doing
> work for an affected index through a btbulkdelete() call, so the
> behavior of _bt_vacuum_needs_cleanup() becomes irrelevant.
>

I agree that there already are huge problems in that case. But I think
we need to consider an append-only case as well; after bulk deletion
on an append-only table, vacuum deletes heap tuples and index tuples,
marking some index pages as dead and setting an XID into btpo.xact.
Since we trigger autovacuums even by insertions based on
autovacuum_vacuum_insert_scale_factor/threshold autovacuum will run on
the table again. But if there is a long-running query a "wasted"
cleanup scan could happen many times depending on the values of
autovacuum_vacuum_insert_scale_factor/threshold and
vacuum_cleanup_index_scale_factor. This should not happen in the old
code. I agree this is DBA problem but it also means this could bring
another new problem in a long-running query case.

> * As you pointed out already, the oldest XID/deleted page from the
> index may be significantly older than the newest. Why should we bucket
> them together?

I agree with this point.

>
> We could easily have a case where most of the deleted pages can be
> recycled -- even when all indexes were originally marked deleted by
> the same VACUUM operation. If there are lots of pages that actually
> can be recycled, it is probably a bad thing to assume that the oldest
> XID is representative of all of them. After all, with the patch we
> only go out of our way to recycle deleted pages when we are almost
> sure that the total number of recyclable pages (pages marked deleted
> during a previous VACUUM) exceeds 2.5% of the total size of the index.
> That broad constraint is important here -- if we do nothing unless
> there are lots of deleted pages anyway, we are highly unlikely to ever
> err on the side of being too eager (not eager enough seems more likely
> to me).
>
> I think that we're justified in making a general assumption inside
> _bt_vacuum_needs_cleanup() (which is documented at the point that we
> call it, inside btvacuumscan()): The assumption that however many
> index pages the metapage says we'll be able to recycle (whatever the
> field says) will in fact turn out to be recyclable if we decide that
> we need to. There are specific cases where that will be kind of wrong,
> as I've gone into, but the assumption/design has many more advantages
> than disadvantages.
>
> I have tried to capture this in v3 of the patch. Can you take a look?
> See the new comments inside _bt_vacuum_needs_cleanup(). Plus the
> comments when we call it inside btvacuumscan().

I basically agreed with the change made in v3 patch. But I think it's
probably worth having a discussion on append-only table cases with
autovacuums triggered by
autovacuum_vacuum_insert_scale_factor/threshold.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2021-02-13 05:04:45 Re: 64-bit XIDs in deleted nbtree pages
Previous Message Bharath Rupireddy 2021-02-13 04:24:08 Re: Why do we have MakeSingleTupleTableSlot instead of not using MakeTupleTableSlot?