| From: | Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com> |
|---|---|
| To: | Salma El-Sayed <salmasayed182003(at)gmail(dot)com> |
| Cc: | pgsql-hackers(at)postgresql(dot)org |
| Subject: | Re: [GSoC 2026] - B-tree Index Bloat Reduction - Approach & Questions |
| Date: | 2026-06-12 15:06:58 |
| Message-ID: | CAEze2Wj5NU8bOwOYdWJ8K+ebCFuykdxKGOzVBdj=mpGnrJsH6g@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Thu, 11 Jun 2026 at 19:25, Salma El-Sayed <salmasayed182003(at)gmail(dot)com> wrote:
>
> Hi Matthias,
>
> Thanks for your email and these detailed answers and questions.
> Apologies for the delayed response, I am right in the middle of my
> university final exams right now.
>
> I am actively using your feedback to shape the design plan, but I
> wanted to go ahead and address a couple of your specific questions
> regarding BTP_MERGED_AWAY pages:
>
> > b. Deleting a BTP_MERGED_AWAY page
> > BTP_MERGED_AWAY pages keep a copy of their old tuples around,
> > which you mention are used by backward scans. This means its contents
> > must also be cleaned up by subsequent VACUUM runs, as the backwards
> > IOS may otherwise return TIDs that have been recycled and recieved new
> > indexed values. This cleanup can result in an empty page - which can
> > happen earlier than the XID horizon in the MergeID. Does this design
> > allow those pages to be reclaimed?
>
> VACUUM will be taught to ignore the contents of BTP_MERGED_AWAY pages.
> The entries inside are not live data, they are ghost copies cached for
> exactly one case:
> a backward scan that was positioned between R and L at the moment of the merge.
> No other reader ever sees them. Forward scans see BTP_MERGED_AWAY and
> skip via the right link.
> New backward scans already read R (which now holds L's data) before
> arriving at L, so they skip L too.
> TID safety is guaranteed by MergeXID. The only scanner that can reach
> L's "ghost copies" is one whose snapshot predates MergeXID. MVCC
> guarantees that the heap rows those TIDs point to cannot be recycled
> while any transaction predating MergeXID is still active, because
> those rows are still visible to that transaction.
I don't think that's accurate. The index is not guaranteed to just
contain live or recently-dead rows: it is almost certainly going to
contain references to rows which are already dead to every session.
This is what VACUUM removes when it goes through index cleanup with
its calls into index_bulk_delete.
It is important to also note that it's not guaranteed that all current
dead rows are being cleaned up by a current index_bulk_delete() call -
if insufficient maintenance_work_mem was allocated, then only a subset
of currently dead rows will be included in the current
index_bulk_delete() call, and it's likely that cleanup will happen
again very soon.
So, if you merge a page that contains references to dead rows (which
you can't necessarily know when merging), the BTP_MERGED_AWAY page
might therefore contain references to rows which are (soon) ALL_DEAD
and would be cleaned up in the next VACUUM scan. So, if the page is
BTP_MERGED_AWAY with dead tuples that have been removed everywhere
else but whose tuples aren't cleaned up by vacuum by principle, then
that's an issue for scanners that took very long to continue on to
that page.
> Once MergeXID is no
> longer visible to any active transaction, the page transitions to
> HALF_DEAD normally. So VACUUM never needs to touch L's entries. The
> page header does all the work.
If a scan can return the items on a page (including BTP_MERGED_AWAY),
then vacuum *must* be able to clean up those items. If the index's
vacuum process doesn't clean up those items when requested, then the
index breaks the API contract of never returning TIDs that were
supposed to be removed in a completed ambulkdelete() call. In that
case, an index-only scan could encounter TIDs in the index that have
since been marked UNUSED by vacuum and gotten their page marked
ALL_VISIBLE; resulting in the dead entries being returned to the user.
> > d. Are BTP_MERGED_AWAY pages still part of the data structure?
> > So, is L still pointed to by both L's left sibling and R, or is it
> > immediately removed from the structure (or at least as immediate as a
> > HALF_DEAD page would)?
> > If it's kept in the structure for extended periods: Why?
>
> L is unlinked from the parent as soon as it becomes BTP_MERGED_AWAY
> and its key space will be assigned to R.
> However, it does remain part of the leaf-level data structure (still
> pointed to by L's left sibling and R).
> This is necessary because backward scans that were positioned between
> L and R during the merge still need to traverse left into L to read
> the original data.
> As soon as it is safe for L to become HALF_DEAD (the MergeXID horizon
> is passed), it will be treated as a normal page deletion.
Could you expand on this "treated as" a bit more? Do you mean that
once the horizon has passed, the next time maintenance comes around
this page will be deleted like a normal empty page would during
vacuum? Or is it immediately considered dead?
Kind regards,
Matthias van de Meent
Databricks (https://www.databricks.com)
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tomas Vondra | 2026-06-12 15:22:58 | Re: s/pg_attribute_always_inline/pg_always_inline/? |
| Previous Message | David E. Wheeler | 2026-06-12 15:00:50 | Re: Why our Valgrind reports suck |