Re: [GSoC 2026] - B-tree Index Bloat Reduction - Approach & Questions

From: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To: Kirk Wolak <wolakk(at)gmail(dot)com>
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>, Robert Haas <robertmhaas(at)gmail(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Salma El-Sayed <salmasayed182003(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [GSoC 2026] - B-tree Index Bloat Reduction - Approach & Questions
Date: 2026-07-03 19:14:36
Message-ID: A7BEE70F-ED6B-4998-9ACC-26DA81983DA4@yandex-team.ru
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello everyone,

First, thank you for the v2 writeup. Reframing BTP_MERGED_AWAY as a
tombstone that carries merge_xid "the same way deleted pages hold
safexid" is seems the right thing to me, and I think it can be pushed
further than the current draft does. Two observations, both aimed at
shrinking the design rather than adding to it.

1. Forward dedup needs one watermark, not a saved TID set
---------------------------------------------------------

The current forward-scan recovery keeps a per-page TID array:

> The following fields are added to BTScanOpaqueData:
> ItemPointerData savedMergeTids[MaxTIDsPerBTreePage];
> ...
> - Loop through currPos.items for R and remove any item that
> exists in savedMergeTids.

and Matthias already flagged the cost of carrying that around:

> I think parallel btree scans need the [skip, need]MergeRecovery
> flags, and possibly the item pointers themselves, too? I think that
> could be a rather large amount of data for the parallel state.

I don't think a set is needed at all for the forward direction. Since
v12 the leaf level is totally ordered by (key, heapTID) -- heap TID is
the tiebreaker in the suffix. So everything a forward scan has already
returned is <= the last (key, heapTID) it emitted. When it steps onto
a page that physically holds its already-seen tuples (post-merge R =
old-L + old-R), it can just resume from the first item strictly
greater than that single last-returned (key, heapTID). One item
pointer plus its key, not MaxTIDsPerBTreePage of them. (Or even just
find border by heapTID if it's not frequent operation.)

Two things fall out of this that I find attractive:

- It degrades to a no-op in the normal case. A normal right sibling
begins above the left page's high key, i.e. above the watermark, so
"skip <= watermark" skips nothing. The skip only fires on a merged
right-step, which is exactly the overlap we want to suppress. So
this can be a uniform rule ("resume the next page after the last
returned (key, heapTID)") rather than a merge special case.
If we know that page is not merged - we can optimize out this skipping.

- It handles "start the scan on a MERGED page", which was left open:

> How do you detect it is the first page in the merge group? What if
> you land on an M page after descending the tree?

A scan that descended to R via the parent carries no watermark and
simply reads R whole -- correct, because it never returned old-L.
The discriminator for "dedup or not" is how you arrived at R (a
right step carrying a watermark vs a downlink descent), which the
scan already knows. It is not a property of the page, so the M flag
is not even required for forward correctness.

Backward direction still reduces to a single watermark (the low end),
but the hard part there is navigation/left-links, not the size of the
dedup state, so it needs separate care. And the whole thing assumes an
ordered MVCC scan; unique checks / SnapshotDirty paths are a different
conversation.

2. MERGED_AWAY looks like the pre-existing half-dead/deleted lifecycle
----------------------------------------------------------------------

You already say:

> * Resetting a BTP_MERGED_AWAY will make it HALF_DEAD.

and

> ... hold the merge_xid in the same way deleted pages hold safexid.

I'd take that literally. L is empty the instant it becomes
MERGED_AWAY; its values are gone; the only reason it must stay
non-recyclable is that a scan holding a pointer to it can still step
right through it -- which is precisely the deleted-page tombstone
invariant Peter described:

> Using XIDs for BTPageGetDeleteXid is just a convenient (though very
> conservative) way to implement "the drain technique". [...] We must
> always ensure that such a backend at least lands on a page marked
> deleted/a tombstone page and then recovers by moving right.

So MERGED_AWAY == a deleted/tombstone page whose safexid is the merge
time. If we treat it that way, the genuinely new state is only
BTP_MERGED on the surviving page during the drain window; L reuses the
existing HALF_DEAD -> DELETED(safexid) -> recyclable machinery
verbatim.

That also lets us avoid stacking two horizons. The current plan is,
as Kirk put it:

> the visibility horizon on a future delete must be after the
> visibility horizon on the cleanup process.

i.e. one horizon to clear the M flags, then a second (safexid) horizon
to recycle L -- which is the "one extra vacuum horizon" reclamation
penalty Matthias called out. But both horizons protect the same
population: scans that predate the merge. If L is unlinked at merge
time with safexid = merge time, clearing R's MERGED flag and recycling
L are gated by one and the same horizon, and the penalty goes away.

Net: I think the forward path can drop savedMergeTids entirely, and
the L-side lifecycle can be the page-deletion lifecycle we already
have, leaving BTP_MERGED on R as the one genuinely new concept to
reason about. That feels much closer to Peter's "confine the
complexity to the page deletion code" bar. I'm not certain about
the backward direction under this framing.

WDYT?

Thank you! Fascinating project!

Best regards, Andrey Borodin.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Zsolt Parragi 2026-07-03 19:54:14 Re: uuidv7 improperly accepts dates before 1970-01-01
Previous Message Sami Imseih 2026-07-03 18:35:53 Re: explain plans for foreign servers