Re: [GSoC 2026] - B-tree Index Bloat Reduction - Approach & Questions

From: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To: Kirk Wolak <wolakk(at)gmail(dot)com>
Cc: Salma El-Sayed <salmasayed182003(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [GSoC 2026] - B-tree Index Bloat Reduction - Approach & Questions
Date: 2026-06-23 17:52:57
Message-ID: CAEze2WjjtHM-c3-Ax9De-r2D9NAHmsU-3dhFp5rsiJTszf7ptA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 23 Jun 2026 at 17:57, Kirk Wolak <wolakk(at)gmail(dot)com> wrote:
>
> On Fri, Jun 12, 2026 at 11:07 AM Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com> wrote:
>>
>> On Thu, 11 Jun 2026 at 19:25, Salma El-Sayed <salmasayed182003(at)gmail(dot)com> wrote:
>> ...
>> > VACUUM will be taught to ignore the contents of BTP_MERGED_AWAY pages.
>> > The entries inside are not live data, they are ghost copies cached for
>> > exactly one case:
>> > a backward scan that was positioned between R and L at the moment of the merge.
>> > No other reader ever sees them. Forward scans see BTP_MERGED_AWAY and
>> > skip via the right link.
>> > New backward scans already read R (which now holds L's data) before
>> > arriving at L, so they skip L too.
>> > TID safety is guaranteed by MergeXID. The only scanner that can reach
>> > L's "ghost copies" is one whose snapshot predates MergeXID. MVCC
>> > guarantees that the heap rows those TIDs point to cannot be recycled
>> > while any transaction predating MergeXID is still active, because
>> > those rows are still visible to that transaction.
>>
>> I don't think that's accurate. The index is not guaranteed to just
>> contain live or recently-dead rows: it is almost certainly going to
>> contain references to rows which are already dead to every session.
>> This is what VACUUM removes when it goes through index cleanup with
>> its calls into index_bulk_delete.
>>
>> It is important to also note that it's not guaranteed that all current
>> dead rows are being cleaned up by a current index_bulk_delete() call -
>> if insufficient maintenance_work_mem was allocated, then only a subset
>> of currently dead rows will be included in the current
>> index_bulk_delete() call, and it's likely that cleanup will happen
>> again very soon.
>>
>> So, if you merge a page that contains references to dead rows (which
>> you can't necessarily know when merging), the BTP_MERGED_AWAY page
>> might therefore contain references to rows which are (soon) ALL_DEAD
>> and would be cleaned up in the next VACUUM scan. So, if the page is
>> BTP_MERGED_AWAY with dead tuples that have been removed everywhere
>> else but whose tuples aren't cleaned up by vacuum by principle, then
>> that's an issue for scanners that took very long to continue on to
>> that page.
>
>
> Allow me to step in on this part of the conversation. BTP_MERGED_AWAY is mostly a FLAG page.
> That is really like a pre- BTP_HALF_DEAD page. NOBODY should be interested in the contents.
> Especially Vacuum. No NORMAL scans should read those contents.
>
> We are leaving the contents on this "L" page (of L,R, and L+R notation), for the 2 VERY specific Edge cases.
> 1) FWD Scans that already read "L", and finds the "L+R" instead of R. (it was caught in the middle due to locks, etc).
> It "knows" there are ghost copies left behind on "L" (BTP_MERGED_AWAY), and it goes back, reads them,
> and filters them out of the L+R results it just processed so the L records are NOT duplicated.
> 2) BWD Scans that read "R" before the merge, then finds "L" (BTP_MERGED_AWAY). It knows it did NOT
> read the L+R page. So, like the previous logic, it can safely read the "L" records.
>
> outside of those two cases, which are clearly edge cases. NOBODY should see those "ghost" records.

My point is this: It is *exactly* a problem if BWD scans read L's data
if L's data is ignored by vacuum.

Consider this situation: L contains TIDs 1, 2, 3; R contains TIDs 4, 5, 6.

1. Session 1: Backward Index-Only scan finishes processing page R
2. Session 2: Deduplicates L + R, into R. L is marked BTP_MERGED_AWAY,
and contains remnants for (1, 2, 3). R now contains (1, 2, 3, 4, 5,
6), and is marked BTP_MERGED
3. Session 3: Vacuums the backing table, finds that TIDs 2, 3, 4 are
dead, and starts cleaning up indexes.
4. Session 3: As part of vacuum, cleans up page R, which now contains
only (1, 5, 6). L is untouched; contains (1, 2, 3).
5. Session 3: After finishing index cleanup; the LP_DEAD items are
removed from the heap, and all pages are marked all-visible;
6. Session 1: Continues the index-only scan. Finds page L, and returns
the remnants on that page L (1, 2, 3) to the IOS.
7. Session 1: IOS internally checks the visibility of the returned
tuples, finds the pages of those tids to be all-visible (see (5)), and
returns the TIDs and values associated with TIDs 1, 2, and 3, to the
higher scan node.

Note that in step (3), TIDs 2 and 3 are dead, and in step 5 they were
completely removed -- there shouldn't be any index scans that still
return those tuples (*). For all that the table knows, TIDs (2, 3, 4)
don't exist anymore. In step 7, however, they're suddenly visible,
because they weren't cleaned up from L by VACUUM.

To fix this; VACUUM *must* process page L, and remove the references
to TIDs 2 and 3, or otherwise mark those TIDs as "not to be returned
by scans", so that in step 6 the scan doesn't return TIDs 2 and 3 to
the IOS layer above.

(*) For Index-Only scans the original statement is accurate, but
normal Index scans (not IOS) will be able to return tuples after that
cleanup happened. However, instead of using the visibilitymap, those
scans will check the visibility of the rows on the heap page directly,
and will ignore (skip) invisible/missing line pointers.

> We are using them just to make the algorithm clean and efficient without having to add something elsewhere
> to handle the fixups.
>
> Back to the question of VACUUM. And where your insight might be trying to help us from making a terrible mistake.
> My understanding is that there can be MANY "extra" references in any index to records that have been LONG deleted.
> And when the Table is read, those are removed from the result set (I believe it's why index-only scans are limited to
> mostly clean tables, freshly vacuumed, etc).

Strictly speaking that's right, but indexes like btree, gist, spgist,
and hash, all have a guarantee that any tuple only has a single
location in the index at any point in time.
Scans may see the same TID pointing to a row (or the same TID pointing
to different versions of a row, or TIDs pointing to different versions
of different rows in case of tid reclamation through vacuum) several
times, but only one of those pointed-to rows can be visible to the
scan. The reason why it can see the same tid value several times is
because it takes time to traverse the index, and because the scan
doesn't have a snapshot view of the whole index. A scan only looks at
parts of the index at once, and the same tid value may be present at
different locations at different times, but only ever at one place in
the index at a time.

> So, our assumption is that, if we envision a RACE condition, where VACUUM is trying to remove the INDEX references.
> BUT our scan is already past that BLOCK. Our scan does not restart, know or care that vacuum wants to remove them
> from the index (there is no "fixup" process required for a scan). And that's our argument for keeping VACUUM ignorant
> of anything on BTP_MERGED_AWAY. Because NO "normal" scan will read those contents. Only one that was likely
> blocked by our activity, and needs to "fix itself" (to avoid dupes or skips).

I think this argument is faulty; if only because scans (and queries
more generally) are not guaranteed to immediately run to completion
without stopping. The perfect example for that is an index scan that's
producing data for a cursor: Cursors may be paused for any amount of
time between any two amgettuple() calls.

> Any scan that runs after our lock doesn't care.

Any scan that *starts* after the lock. Scans that started before the
lock but continue running after the lock *will* care.

And I'm not so sure about your claim that *any* scan doesn't care.
With scan boundary keys I think you may be right, but I'm yet to be
convinced that boundary keys provide sufficient performance to get
this into the tree.

> If they ran before our lock, and finished, they don't care. The 2 edge cases care... Because it causes Skips (FWD scan),
> and Dupes (BWD scan), otherwise.

Indeed.

> We even avoid using BTP_HALF_DEAD in our case, to "keep this flag" for the edge cases.
> Technically if we could KNOW there are no more edge cases alive, we could have VACUUM always force this
> BTP_MERGED_AWAY into BTP_HALF_DEAD status. That was our original thought process. But you see that
> we cannot know we have an edge case... Until... No queries could possibly see past our transaction marker.

Why do you need to read the TIDs from the MERGED_AWAY page? Can't you
keep just (and only just) the MERGED_AWAY status + its original key
range around, and use that to extract the moved tuples from the
_MERGED page (... and its successors in case of page splits of that
page)? It is probably a bit (lot) more effort, but it should avoid the
TID recycling issue mentioned above, in cases of backward scans.

> For now, for testing and validation... We want to get it working first, and have it take a few steps so we
> can watch the table "self-recover" from the merge with the Autovacuum.

What do you mean by "self-recover"? Recycle index pages and reduce in size?

> This is why you will hear Salma defend vacuum "not looking" at the old "L" records, certainly not changing them.
> They are REQUIRED to be the snapshot for what was merged... Only for the 2 edge cases.

I'm not convinced about that requirement being the right direction; we
can't afford to create a new known visibility bug just to fix known
scan edge cases for page merging.

Kind regards,

Matthias van de Meent
Databricks (https://www.databricks.com)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2026-06-23 18:01:49 Re: Add a hook for handling logical decoding messages on subscribers.
Previous Message Peter Geoghegan 2026-06-23 16:24:52 Re: [GSoC 2026] - B-tree Index Bloat Reduction - Approach & Questions