Re: [GSoC 2026] - B-tree Index Bloat Reduction - Approach & Questions

From: Salma El-Sayed <salmasayed182003(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Kirk Wolak <wolakk(at)gmail(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [GSoC 2026] - B-tree Index Bloat Reduction - Approach & Questions
Date: 2026-06-30 17:36:54
Message-ID: CANBEAPFnx4eTOjNWnmtzoEYkZimSYC-7A2ouk76VcpzcfTrtNA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers,

We want to share a new design we are working on for handling page merges.

Instead of using BTP_MERGED_AWAY simply as a cache, this design
repurposes it to direct scans for merge recovery and to hold the
merge_xid in the same way deleted pages hold safexid.

This design is based on the fact that the currPos.items array is only
overwritten by _bt_readpage, meaning the data remains intact until the
next _bt_readpage call for the subsequent page.

The core changes are centralized in _bt_readnextpage.

The following fields are added to BTScanOpaqueData:
bool skipMergeRecovery;
bool needMergeRecovery;
ItemPointerData savedMergeTids[MaxTIDsPerBTreePage];
int nSavedMergeTids; /* number of TIDs in the array */

Here is how the scan logic works across a sequence of pages:

*FORWARD SCAN*

Scenario: L(MA) <--> R(M) <--> A(M) <--> B(MA) <--> C(HD) <--> D(M)
(MA = MERGED_AWAY, M = MERGED, HD = HALF_DEAD)

subsequent MERGED pages happen due to the split of page R.

1) The scan steps into L and finds the MA flag. It skips L and sets
skipMergeRecovery = true.

2) The scan steps into R and finds the M flag. It checks skipMergeRecovery:
a) If true: It reads R as a normal page. Page A is treated the same way.
b) If false:
- currPos.items still holds the tuples read from L. We save
these items into savedMergeTids (we only do this step for the first M
page in the merge group).
- Read R by calling _bt_readpage.
- Loop through currPos.items for R and remove any item that
exists in savedMergeTids.
- Return tuples to the caller if any remain after filtering.
Otherwise, continue to the next M page in the group.

3) The scan steps into B (MA) and does the same as L.
4) The scan steps into C (HD) and skips it.
5) The scan steps into D (M) and does the same as R.

skipMergeRecovery is reset to false as soon as the scan steps onto a
normal (non-M, non-MA) page, making the scan ready for the next merge
group.

*BACKWARD SCAN*

Scenario: L(MA) <--> R(M) <--> A(M) <--> B(M) <--> C(normal page)

Definitions:
- skipMergeRecovery: The scan read a M page after a merge occurred,
requiring no recovery.
- needMergeRecovery: The scan read a M page before the merge, then
stepped into an MA page. The flag is set to true to start recovery.
Stepping into an MA page again while true means recovery is finished.

1) The scan steps into L:
a) If skipMergeRecovery OR needMergeRecovery is true: The scan read
merged pages and will skip this MA page. Reset both booleans to be
ready for the next merged group.
b) If neither is true: The scan read R before the merge, and
currPos.items still has the data read from it.
- Save these items into savedMergeTids.
- Walk forward until reaching C (the first page with no M flag).
- Set blkno = B, and lastcurrblkno = C (needed to recover from
splits in the backward scan).
- The scan continues walking backward from B.

2) The scan steps into B, A, or R:
a) If needMergeRecovery is false: The simplest case. Read the page
normally and set skipMergeRecovery = true, so that when the scan steps
left onto the MA tombstone, it knows to skip it cleanly.
b) If needMergeRecovery is true: The scan is in merge recovery.
- Call _bt_readpage to read the page.
- Loop through currPos.items and remove any item that exists in
savedMergeTids.
- Return tuples to the caller if any remain after filtering.
Otherwise, continue to the next M page in the group.

*VACUUM AND FLAG LIFECYCLE*

MERGED pages cannot be merged again until the flag is reset.

The MERGED_AWAY flag serves as a separator between merged groups and
directs scans to recover from the merge operation. Therefore,
MERGED_AWAY must stick around until all MERGED flags in its merge
group are reset.

Any MERGED page can be deleted at any point if VACUUM decides to
delete it, without breaking the algorithm. It is impossible to have a
normal page between MA and M pages. Having an HD page in between is
not an issue as long as the MERGED flag is never reset in that HD
page.

We can use the MERGED_AWAY page to save merge_xid as the page content,
exactly the same way safexid works for deleted pages.

When VACUUM steps into a MERGED_AWAY page and the transaction horizon
has passed, it MUST reset all MERGED flags first. VACUUM must locate
the end of the MERGED group, then walk backward to reset all pages
(similar to how the backward scan recovers). This guarantees a normal
page will never exist between MA and M pages.

* Resetting a BTP_MERGED flag means clearing it from btpo_flags,
leaving the page as a standard BTP_LEAF page again.
* Resetting a BTP_MERGED_AWAY will make it HALF_DEAD.

Best,
Salma El-Sayed

On Fri, Jun 26, 2026 at 6:50 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>
> On Fri, Jun 26, 2026 at 10:54 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> > From a certain point of view, the suspicion that the algorithm may be
> > incorrect is not at all surprising. The fact that we never merge
> > nbtree pages to improve space utilization is a limitation we've had
> > since I first got involved with PostgreSQL which was, at the risk of
> > dating myself, a long time ago. The fact that nobody's fixed yet means
> > it's probably really hard.
>
> Page deletion is easily the most complicated part of the nbtree
> design. That is a natural consequence of the absurdly permissive rules
> for searchers under the Lehman & Yao design. Searchers never need to
> lock more than one page at a time, on any level. But they can never
> have an irredeemably bad picture of the tree structure, even with
> concurrent page deletions.
>
> In short, the eye-watering complexity of page deletion is the exact
> cost you pay to completely avoid lock coupling with the Lehman & Yao
> design. If you completely ignore page deletion (as in the original L&Y
> paper), the concurrency rules are simple and intuitive. Most
> individual nbtree code paths effectively do just that: they pretend
> page deletion does not exist. Searchers can just move right (as with a
> concurrent page split) when they land on a deleted page. Everything
> works out because the real complexity is confined to the page
> deletion/page recycling code.
>
> Confining the complexity to the page deletion code is a very useful
> property, and one that I *really* don't want to give up. I'm not
> prepared to say that it could never make sense, but the bar for
> accepting such a change is *very* high.
>
> > I suspect everyone here would love to see a solution to this problem,
> > but I would personally be beyond impressed if someone managed to come
> > up with even a working prototype for this in one summer. Even a
> > respectable amount of progress on figuring out an algorithm would be a
> > good result, IMHO.
> >
> > I don't know what the rules for GSoC are, but if it's still possible
> > to switch to a different project, that might be worth considering.
>
> +1
>
> --
> Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2026-06-30 17:37:57 Re: Fix race condition in pg_get_publication_tables with concurrent DROP TABLE
Previous Message Noah Misch 2026-06-30 17:30:53 Wrong query result w/ propgraph single lateral col reference