Re: [GSoC 2026] - B-tree Index Bloat Reduction - Approach & Questions

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Salma El-Sayed <salmasayed182003(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [GSoC 2026] - B-tree Index Bloat Reduction - Approach & Questions
Date: 2026-06-11 19:18:43
Message-ID: CA+TgmobHTwvswhMx=fabNSHQ47Jgzhaafn1yOXPUHYksiuoORg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, May 8, 2026 at 10:40 AM Salma El-Sayed
<salmasayed182003(at)gmail(dot)com> wrote:
> A forward scanner that arrives at L after the merge sees BTP_MERGED_AWAY and follows through to R.
> A backward scanner that arrives at R after the merge sees BTP_MERGED, reads R (which now contains L's data), and skips L entirely.

This seems OK for the first merge, but I think you need to be a lot
more explicit about what's going to happen after that. For instance,
what if you want to perform a merge on a page that is already marked
BTP_MERGED?

Or, for example, what happens if more splits happen after the merge?
Like if we have page A and then page B, we might mark A
BTP_MERGED_AWAY and B BTP_MERGED. Now suppose at the time this
happens, a scan is pointing to page A. Before the scan advances to
page B, that page gets split, so now we have: A(BTP_MERGED_AWAY) B0
(???) B1 (???). The problem is that we've already read page A, so we
need to use the logic that skips over tuples we may have already read
on both B0 and B1, both of which contain some of the tuples from B,
which now includes everything that we already read from A. So
presumably to make that work, we need to mark both B0 and B1 as
BTP_MERGED.

But if we do that, then there's no longer a 1:1 relationship between
BTP_MERGED_AWAY pages and BTP_MERGED pages. When we come to a
BTP_MERGED page, we don't know if it corresponds to some
BTP_MERGED_AWAY page we previously encountered, or some other
BTP_MERGED_AWAY page from long ago.

I'm not certain, but I am suspicious that using flag bits for this is
not going to work out. Maybe a flag bit is OK for the page that is
going away, because then it eventually transitions to half-dead like
you said. But for the surviving page, if that's just indicated by it
being marked BTP_MERGED, then eventually we can just end up with tons
of BTP_MERGED pages in the heap and there's nothing to unset those
bits. That's probably going to break something; if it doesn't, then it
seems unclear that BTP_MERGED needs to exists in the first place. I
feel like we might need to mark the surviving page using some kind of
indicator that "times out," like an XID or something, so that we don't
have to go back and clear BTP_MERGED flags later. But I don't really
know.

> How should the merge process be triggered?

This seems really tricky. I think if the user has to manually run a
"try to merge pages" command or function, this functionality won't get
used very much. Ideally it would happen either automatically during
foreground operation, or as part of VACUUM. But that seems complicated
to make work, because there's a risk of merging pages too
aggressively, which could not only waste work but result in them being
split again soon afterward.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2026-06-11 19:20:17 Re: uuidv7 improperly accepts dates before 1970-01-01
Previous Message Tom Lane 2026-06-11 18:57:01 Re: Make SPI_prepare argtypes argument const