Re: [GSoC 2026] - B-tree Index Bloat Reduction - Approach & Questions

From: Kirk Wolak <wolakk(at)gmail(dot)com>
To: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc: Salma El-Sayed <salmasayed182003(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [GSoC 2026] - B-tree Index Bloat Reduction - Approach & Questions
Date: 2026-06-23 15:57:20
Message-ID: CACLU5mRfdcNLruKbb+sqmripVLE=tQ8_OJRhNGbZZJyRXY=-QA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jun 12, 2026 at 11:07 AM Matthias van de Meent <
boekewurm+postgres(at)gmail(dot)com> wrote:

> On Thu, 11 Jun 2026 at 19:25, Salma El-Sayed <salmasayed182003(at)gmail(dot)com>
> wrote:
> ...
> > VACUUM will be taught to ignore the contents of BTP_MERGED_AWAY pages.
> > The entries inside are not live data, they are ghost copies cached for
> > exactly one case:
> > a backward scan that was positioned between R and L at the moment of the
> merge.
> > No other reader ever sees them. Forward scans see BTP_MERGED_AWAY and
> > skip via the right link.
> > New backward scans already read R (which now holds L's data) before
> > arriving at L, so they skip L too.
> > TID safety is guaranteed by MergeXID. The only scanner that can reach
> > L's "ghost copies" is one whose snapshot predates MergeXID. MVCC
> > guarantees that the heap rows those TIDs point to cannot be recycled
> > while any transaction predating MergeXID is still active, because
> > those rows are still visible to that transaction.
>
> I don't think that's accurate. The index is not guaranteed to just
> contain live or recently-dead rows: it is almost certainly going to
> contain references to rows which are already dead to every session.
> This is what VACUUM removes when it goes through index cleanup with
> its calls into index_bulk_delete.
>
> It is important to also note that it's not guaranteed that all current
> dead rows are being cleaned up by a current index_bulk_delete() call -
> if insufficient maintenance_work_mem was allocated, then only a subset
> of currently dead rows will be included in the current
> index_bulk_delete() call, and it's likely that cleanup will happen
> again very soon.
>
> So, if you merge a page that contains references to dead rows (which
> you can't necessarily know when merging), the BTP_MERGED_AWAY page
> might therefore contain references to rows which are (soon) ALL_DEAD
> and would be cleaned up in the next VACUUM scan. So, if the page is
> BTP_MERGED_AWAY with dead tuples that have been removed everywhere
> else but whose tuples aren't cleaned up by vacuum by principle, then
> that's an issue for scanners that took very long to continue on to
> that page.
>

Allow me to step in on this part of the conversation. BTP_MERGED_AWAY is
mostly a FLAG page.
That is really like a pre- BTP_HALF_DEAD page. NOBODY should be interested
in the contents.
Especially Vacuum. No NORMAL scans should read those contents.

We are leaving the contents on this "L" page (of L,R, and L+R notation),
for the 2 VERY specific Edge cases.
1) FWD Scans that already read "L", and finds the "L+R" instead of R. (it
was caught in the middle due to locks, etc).
It "knows" there are ghost copies left behind on "L" (BTP_MERGED_AWAY), and
it goes back, reads them,
and filters them out of the L+R results it just processed so the L records
are NOT duplicated.
2) BWD Scans that read "R" before the merge, then finds "L"
(BTP_MERGED_AWAY). It knows it did NOT
read the L+R page. So, like the previous logic, it can safely read the "L"
records.

outside of those two cases, which are clearly edge cases. NOBODY should
see those "ghost" records.
We are using them just to make the algorithm clean and efficient without
having to add something elsewhere
to handle the fixups.

Back to the question of VACUUM. And where your insight might be trying to
help us from making a terrible mistake.
My understanding is that there can be MANY "extra" references in any index
to records that have been LONG deleted.
And when the Table is read, those are removed from the result set (I
believe it's why index-only scans are limited to
mostly clean tables, freshly vacuumed, etc).

So, our assumption is that, if we envision a RACE condition, where VACUUM
is trying to remove the INDEX references.
BUT our scan is already past that BLOCK. Our scan does not restart, know
or care that vacuum wants to remove them
from the index (there is no "fixup" process required for a scan). And
that's our argument for keeping VACUUM ignorant
of anything on BTP_MERGED_AWAY. Because NO "normal" scan will read those
contents. Only one that was likely
blocked by our activity, and needs to "fix itself" (to avoid dupes or
skips). Any scan that runs after our lock doesn't care.
If they ran before our lock, and finished, they don't care. The 2 edge
cases care... Because it causes Skips (FWD scan),
and Dupes (BWD scan), otherwise.

We even avoid using BTP_HALF_DEAD in our case, to "keep this flag" for the
edge cases.
Technically if we could KNOW there are no more edge cases alive, we could
have VACUUM always force this
BTP_MERGED_AWAY into BTP_HALF_DEAD status. That was our original thought
process. But you see that
we cannot know we have an edge case... Until... No queries could possibly
see past our transaction marker.

Now, it's at this point that I wonder if we can actually just DELETE the
BTP_MERGED_AWAY page, by VACUUM.
But what we prefer, is that it gets marked half dead, and the
TXNID_Block_No is used to flip the BTP_MERGED page(s)
that follow (because there could have been page splits). And all of those
pages that are past that horizon, WILL BE
flipped back to NORMAL pages (from BTP_MERGED).

For now, for testing and validation... We want to get it working first,
and have it take a few steps so we
can watch the table "self-recover" from the merge with the Autovacuum.

This is why you will hear Salma defend vacuum "not looking" at the old "L"
records, certainly not changing them.
They are REQUIRED to be the snapshot for what was merged... Only for the 2
edge cases.

Thanks for the comments. And for the record, I am one of the mentors on
this GSoC for Salma.
(Today's lesson for her, is how to defend your position, while being
prepared for the public beat down (lol) where you might be wrong. Be wrong
out loud! You learn faster!)

Kirk

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Baji Shaik 2026-06-23 16:02:57 Re: [PATCH] Warn when io_min_workers exceeds io_max_workers
Previous Message Diego 2026-06-23 15:57:03 libpq: decouple the .pgpass lookup port from the connection port