Re: [WiP] B-tree page merge during vacuum to reduce index bloat

From: Kirk Wolak <wolakk(at)gmail(dot)com>
To: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Nikolay Samokhvalov <nik(at)postgres(dot)ai>
Subject: Re: [WiP] B-tree page merge during vacuum to reduce index bloat
Date: 2025-08-26 14:11:43
Message-ID: CACLU5mRude0L5psEj5WS0DVDv=AHN0McfZBKV5eBoW0JqwwZDA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Aug 26, 2025 at 6:33 AM Matthias van de Meent <
boekewurm+postgres(at)gmail(dot)com> wrote:

> On Tue, 26 Aug 2025 at 11:40, Andrey Borodin <x4mmm(at)yandex-team(dot)ru> wrote:
> >
> > Hi hackers,
> >
> > Together with Kirk and Nik we spent several online hacking sessions to
> tackle index bloat problems [0,1,2]. As a result we concluded that B-tree
> index page merge should help to keep an index from pressuring
> shared_buffers.
> >
> > We are proposing a feature to automatically merge nearly-empty B-tree
> leaf pages during VACUUM operations to reduce index bloat. This addresses a
> common issue where deleted tuples leave behind sparsely populated pages
> that traditional page deletion cannot handle because they're not completely
> empty.
> >
> ...
> I'm fairly sure there is a correctness issue here; I don't think you
> correctly detect the two following cases:
>
> 1.) a page (P0) is scanned by a scan, finishes processing the results,
> and releases its pin. It prepares to scan the next page of the scan
> (P1).
> 2.) a page (A) is split, with new right sibling page B,
> 3.) and the newly created page B is merged into its right sibling C,
> 4.) the scan continues on to the next page
>
> For backward scans, if page A is the same page as the one identified
> with P1, the scan won't notice that tuples from P1 (aka A) have been
> moved through B to P0 (C), causing the scan to skip processing for
> those tuples.
> For forward scans, if page A is the same page as the one identified
> with P0, the scan won't notice that tuples from P0 (A) have been moved
> through B to P1 (C), causing the scan to process those tuples twice in
> the same scan, potentially duplicating results.
>
> NB: Currently, the only way for "merge" to happen is when the index
> page is completely empty. This guarantees that there is no movement of
> scan-visible tuples to pages we've already visited/are about to visit.
> This invariant is used extensively to limit lock and pin coupling (and
> thus: improve performance) in index scans; see e.g. in 1bd4bc85. This
> patch will invalidate that invariant, and therefore it will require
> (significantly) more work in the scan code (incl. nbtsearch.c) to
> guarantee exactly-once results + no false negatives.
>
> Kind regards,
>
> Matthias van de Meent
> Databricks
>

This was one of our concerns. We will review the patch mentioned.
I do have a question, one of the IDEAS we discussed was to ADD a new page
that combined the 2 pages.
Making the deletion "feel" like a page split.

This has the advantage of leaving the original 2 pages alone for anyone who
is currently traversing.
And like the page split, updating the links around while marking the pages
for the new path.

The downside to this approach is that we are "adding 1 page to then mark 2
pages as unused".

Could you comment on this secondary approach?

Thanks in advance!

Kirk

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Ken Marshall 2025-08-26 14:15:13 Re: Non-reproducible AIO failure
Previous Message Tom Lane 2025-08-26 14:09:56 Re: Feature request: A method to configure client-side TLS ciphers for streaming replication