Re: [GSoC 2026] - B-tree Index Bloat Reduction - Approach & Questions

From: Kirk Wolak <wolakk(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Salma El-Sayed <salmasayed182003(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Peter Geoghegan <pg(at)bowt(dot)ie>
Subject: Re: [GSoC 2026] - B-tree Index Bloat Reduction - Approach & Questions
Date: 2026-07-01 15:17:57
Message-ID: CACLU5mR2Rt+LUFe9-UFK2GUGid-tpoSknRbHCstkRd837Uef6w@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jun 26, 2026 at 10:54 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Tue, Jun 23, 2026 at 4:46 PM Kirk Wolak <wolakk(at)gmail(dot)com> wrote:
> >> (Today's lesson for her, is how to defend your position, while being
> prepared for the public beat down (lol) where you might be wrong. Be wrong
> out loud! You learn faster!)
> >
> ... The fact that they care
> enough to respond is a good sign, not a bad one.
>
> ...
>
> I suspect everyone here would love to see a solution to this problem,
> but I would personally be beyond impressed if someone managed to come
> up with even a working prototype for this in one summer. Even a
> respectable amount of progress on figuring out an algorithm would be a
> good result, IMHO.
>

Okay, first, thanks for the thoughtful response(s) from everyone!

As for changing the GSoC project. Nah... We don't mind hard. Our
fallback position always was that the project
will be successful if it changed the discussion and made it "SEEM"
possible, and left behind the "currently unstated requirements"

Also, we have almost 7 full weeks to the mid-terms, and at this point, we
have a rough prototype, and will have the Vacuum Cleanup
process before the mid-term (it's a 22 week long project, ending in
November). We are crossing our Ts in this process, but will provide some
PoC code in order to garner "Feedback" once the vacuum and the complete
life cycle is handled.

To be clear. Our goal is the 80/20 level implementation. Something that
is provably correct, has very little impact on normal scans, etc.
We only work on LEAF nodes... Because that's where the 80/20 lays for
now. Honestly, with all of our restrictions, it will take MANY passes to
remove all of the bloat we can. The upside of this approach was
ACCIDENTAL... That it probably fits better INSIDE of autovacuum! (Again,
beyond our current scope).

We have a lot of hurdles to clear. Many of them, we probably do not even
know yet. But the feedback so far has helped us to rethink, rewrite and
improve the logic (no ghost records on the L page, no extra data required
on the BTP_MERGED pages).

Also, we have an assumption that we will not delete BTP_MERGED[_AWAY] pages
until the "Family" of leaf nodes have been reset back to normal. TBH, the
visibility horizon on a future delete must be after the visibility horizon
on the cleanup process. (Currently, the existence of a half_dead page in
the middle would break assumptions). But since it is vacuums job to reset
these to normal pages, as it does so, it can also start marking them
half_dead and ultimately remove them. (We are still studying Vacuum, so if
I sound like I am speaking tongues to those who KNOW better... Forgive
me. It's a high-level statement).

Suffice it to say. We are quite happy with the progress so far.

Kirk

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Anthonin Bonnefoy 2026-07-01 15:21:18 Re: Improve row estimation with multi-column unique indexes
Previous Message Marcos Pegoraro 2026-07-01 15:10:49 Re: [PATCH] Add pg_get_table_ddl() to reconstruct CREATE TABLE statements