Re: 64-bit XIDs in deleted nbtree pages

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: 64-bit XIDs in deleted nbtree pages
Date: 2021-02-10 10:19:25
Message-ID: CAD21AoCVj0B5jVs3YnqA=hDii=+m+N3=N49dyazuxJbEsjX9ug@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 10, 2021 at 10:53 AM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>
> On Tue, Feb 9, 2021 at 2:14 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> > The first patch teaches nbtree to use 64-bit transaction IDs here, and
> > so makes it impossible to leak deleted nbtree pages. This patch is the
> > nbtree equivalent of commit 6655a729, which made GiST use 64-bit XIDs
> > due to exactly the same set of problems.

Thank you for working on this!

>
> There is an unresolved question for my deleted page XID patch: what
> should it do about the vacuum_cleanup_index_scale_factor feature,
> which added an XID to the metapage (its btm_oldest_btpo_xact field). I
> refer to the work done by commit 857f9c36cda for Postgres 11 by
> Masahiko. It would be good to get your opinion on this as the original
> author of that feature, Masahiko.
>
> To recap, btm_oldest_btpo_xact is supposed to be the oldest XID among
> all deleted pages in the index, so clearly it needs to be carefully
> considered in my patch to make the XIDs 64-bit. Even still, v1 of my
> patch from today more or less ignores the issue -- it just gets a
> 32-bit version of the oldest value as determined by the oldestBtpoXact
> XID tracking stuff (which is largely unchanged, except that it works
> with 64-bit Full Transaction Ids now).
>
> Obviously it is still possible for the 32-bit btm_oldest_btpo_xact
> field to wrap around in v1 of my patch. The obvious thing to do here
> is to add a new epoch metapage field, effectively making
> btm_oldest_btpo_xact 64-bit. However, I don't think that that's a good
> idea. The only reason that we have the btm_oldest_btpo_xact field in
> the first place is to ameliorate the problem that the patch
> comprehensively solves! We should stop storing *any* XIDs in the
> metapage. (Besides, adding a new "epoch" field to the metapage would
> be relatively messy.)

I agree that btm_oldest_btpo_xact will no longer be necessary in terms
of recycling deleted pages.

The purpose of btm_oldest_btpo_xact is to prevent deleted pages from
being leaked. As you mentioned, it has the oldest btpo.xact in
BTPageOpaqueData among all deleted pages in the index. Looking back to
the time when we develop INDEX_CLEANUP option if we skip index cleanup
(meaning both ambulkdelete and amvaucumcleanup), there was a problem
in btree indexes that deleted pages could never be recycled if XID
wraps around. So the idea behind btm_oldest_btpo_xact is, we remember
the oldest btpo.xact among the all deleted pages and do btvacuumscan()
if this value is older than global xmin (meaning there is at least one
recyclable page). That way, we can recycle the deleted pages without
leaking the pages (of course, unless INDEX_CLEANUP is disabled).

Given that we can guarantee that deleted pages never be leaked by
using 64-bit XID, I also think we don't need this value. We can do
amvacuumcleanup only if the table receives enough insertions to update
the statistics (i.g., vacuum_cleanup_index_scale_factor check). I
think this is a more desirable behavior. Not skipping amvacuumcleanup
if there is even one deleted page that we can recycle is very
conservative.

Considering your idea of keeping newly deleted pages in the meta page,
I can see a little value that keeping btm_oldest_btpo_xact and making
it 64-bit XID. I described below.

>
> Here is a plan that allows us to stop storing any kind of XID in the
> metapage in all cases:
>
> 1. Stop maintaining the oldest XID among all deleted pages in the
> entire nbtree index during VACUUM. So we can remove all of the
> BTVacState.oldestBtpoXact XID tracking stuff, which is currently
> something that even _bt_pagedel() needs special handling for.
>
> 2. Stop considering the btm_oldest_btpo_xact metapage field in
> _bt_vacuum_needs_cleanup() -- now the "Cleanup needed?" logic only
> cares about maintaining reasonably accurate statistics for the index.
> Which is really how the vacuum_cleanup_index_scale_factor feature was
> intended to work all along, anyway -- ISTM that the oldestBtpoXact
> stuff was always just an afterthought to paper-over this annoying
> 32-bit XID issue.
>
> 3. We cannot actually remove the btm_oldest_btpo_xact XID field from
> the metapage, because of course that would change the BTMetaPageData
> struct layout, which breaks on-disk compatibility. But why not use it
> for something useful instead? _bt_update_meta_cleanup_info() can use
> the same field to store the number of "newly deleted" pages from the
> last btbulkdelete() instead. (See my email from earlier for the
> definition of "newly deleted".)
>
> 4. Now _bt_vacuum_needs_cleanup() can once again consider the
> btm_oldest_btpo_xact metapage field -- except in a totally different
> way, because now it means something totally different: "newly deleted
> pages during last btbulkdelete() call" (per item 3). If this # pages
> is very high then we probably should do a full call to btvacuumscan()
> -- _bt_vacuum_needs_cleanup() will return true to make that happen.
>
> It's unlikely but still possible that a high number of "newly deleted
> pages during the last btbulkdelete() call" is in itself a good enough
> reason to do a full btvacuumscan() call when the question of calling
> btvacuumscan() is considered within _bt_vacuum_needs_cleanup(). Item 4
> here conservatively covers that. Maybe the 32-bit-XID-in-metapage
> triggering condition had some non-obvious value due to a natural
> tendency for it to limit the number of deleted pages that go
> unrecycled for a long time. (Or maybe there never really was any such
> natural tendency -- still seems like a good idea to make the change
> described by item 4.)
>
> Even though we are conservative (at least in this sense I just
> described), we nevertheless don't actually care about very old deleted
> pages that we have not yet recycled -- provided there are not very
> many of them. I'm thinking of "~2% of index" as the new "newly deleted
> during last btbulkdelete() call" threshold applied within
> _bt_vacuum_needs_cleanup(). There is no good reason why older
> deleted-but-not-yet-recycled pages should be considered more valuable
> than any other page that can be used when there is a page split.

Interesting.

I like this idea that triggers btvacuumscan() if there are many newly
deleted pages. I think this would be helpful especially for the case
of bulk-deletion on the table. But why we use the number of *newly*
deleted pages but not the total number of deleted pages in the index?
IIUC if several btbulkdelete executions deleted index pages less than
2% of the index and those deleted pages could not be recycled yet,
then the number of recyclable pages would exceed 2% of the index in
total but amvacuumcleanup() would not trigger btvacuumscan() because
the last newly deleted pages are less than the 2% threshold. I might
be missing something though.

Also, we need to note that having newly deleted pages doesn't
necessarily mean these always are recyclable at that time. If the
global xmin is still older than deleted page's btpo.xact values, we
still could not recycle them. I think btm_oldest_btpo_xact probably
will help this case. That is, we store the oldest btpo.xact among
those newly deleted pages to btm_oldest_btpo_xact and we trigger
btvacuumscan() if there are many newly deleted pages (more than 2% of
index) and the btm_oldest_btpo_xact is older than the global xmin (I
suppose each newly deleted pages could have different btpo.xact).

>
> Observations about on-disk compatibility with my patch + this 4 point scheme:
>
> A. It doesn't matter that pg_upgrade'd indexes will have an XID value
> in btm_oldest_btpo_xact that now gets incorrectly interpreted as
> "newly deleted pages during last btbulkdelete() call" under the 4
> point scheme I just outlined.
>
> The spurious value will get cleaned up on the next VACUUM anyway
> (whether VACUUM goes through btbulkdelete() or through
> btvacuumcleanup()). Besides, most indexes always have a
> btm_oldest_btpo_xact value of 0.
>
> B. The patch I posted earlier doesn't actually care about the
> BTREE_VERSION of the index at all. And neither does any of the stuff I
> just described for a future v2 of my patch.
>
> All indexes can use the new format for deleted pages. On-disk
> compatibility is easy here because the contents of deleted pages only
> need to work as a tombstone. We can safely assume that old-format
> deleted pages (pre-Postgres 14 format deleted pages) must be safe to
> recycle, because the pg_upgrade itself restarts Postgres. There can be
> no backends that have dangling references to the old-format deleted
> page.
>
> C. All supported nbtree versions (all nbtree versions
> BTREE_MIN_VERSION+) get the same benefits under this scheme.
>
> Even BTREE_MIN_VERSION/version 2 indexes are dynamically upgradable to
> BTREE_NOVAC_VERSION/version 3 indexes via a call to
> _bt_upgrademetapage() -- that has been the case since BTREE_VERSION
> was bumped to BTREE_NOVAC_VERSION/version 3 for Postgres 11's
> vacuum_cleanup_index_scale_factor feature. So all nbtree indexes will
> have the btm_oldest_btpo_xact metapage field that I now propose to
> reuse to track "newly deleted pages during last btbulkdelete() call",
> per point 4.
>
> In summary: There are no special cases here. No BTREE_VERSION related
> difficulties. That seems like a huge advantage to me.

Great! I'll look at the v2 patch.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hou, Zhijie 2021-02-10 10:22:23 RE: Parallel INSERT (INTO ... SELECT ...)
Previous Message Amit Langote 2021-02-10 10:13:01 Re: Parallel INSERT (INTO ... SELECT ...)