| From: | Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com> |
|---|---|
| To: | Salma El-Sayed <salmasayed182003(at)gmail(dot)com> |
| Cc: | pgsql-hackers(at)postgresql(dot)org |
| Subject: | Re: [GSoC 2026] - B-tree Index Bloat Reduction - Approach & Questions |
| Date: | 2026-05-12 11:47:23 |
| Message-ID: | CAEze2Wjs8juPjXUTzYbCKOfv7rre8eNF7LEoq6EJX5j5UYC_MA@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Fri, 8 May 2026 at 16:40, Salma El-Sayed <salmasayed182003(at)gmail(dot)com> wrote:
>
> Hello Hackers,
>
> My name is Salma El-Sayed, and I am working on B-tree Index Bloat Reduction as part of GSoC 2026. This is my introduction email[1]. I would like to share my current approach for page merging and get your feedback.
>
> *The Core Idea*
> Two sibling B-tree pages L (left) and R (right) are candidates for merging when both pages are below a size threshold (less than 10% full for example) and their combined size fits within the fill factor.
>
> The merge copies all data from L into R, then:
> 1- R is marked BTP_MERGED --> it now contains L's data.
> 2- L is marked BTP_MERGED_AWAY --> it is logically gone but kept around temporarily for any active scan that still holds a reference to it.
> 3- L is unlinked from its parent. VACUUM will handle updating the right-link of L's left sibling as part of normal cleanup.
The merging of pages is the simple part. Making scans not skip or
return duplicated tuples is the hard part.
I think it's useful and important to have an up-to-date design with
all such considerations included, so that reviewers also are able to
have the whole picture in mind when providing guidance.
> *Questions for the Community*
>
> 1- Triggering the merge: how aggressive should we be?
>
> How should the merge process be triggered? I see two broad options:
> a- Full-index scan: aggressively find all merge candidates in one pass.
> b- Bounded scan: a separate dedicated scan that stops after a controlled number of pages, giving the user control over how much of the index to process at a time.
I suggest to keep track of candidate pages (i.e., leaf pages with more
than X% free space) in the btree vacuum scans. Whenever you hit a page
which are a merge candidate (or, would become a merge candidate), you
check if any of its sibling pages are already present as merge
candidate in a list of candidate merge pages. If one is, merge the
pages. If not; then add the current page not; and check if siblings of
the current page are in our candidate list. If so, try to merge with
the sibling; if they're not (or the merge failed), then you add the
current page to that candidate list.
Presumably, the amount of active candidates would remain small enough
to fit in a reasonable amount of memory; as merged candidates can be
removed once they've been used.
> A possible interface for the candidate scanner might look something like:
> bt_merge_pages(index regclass, min_fillfactor float8, max_fillfactor float8, max_pages int4)
I don't think a separate scan method for this makes much sense.
> 2- Flag naming and free bits
>
> I am proposing two new flags in btpo_flags:
> - BTP_MERGED: set on R to indicate it absorbed L's data.
> - BTP_MERGED_AWAY: set on L to indicate it has been merged away.
>
> ** Are there free bits available in btpo_flags suitable for this purpose? And does the community see any objection to consuming two bits in this field for the merge mechanism?
btpo_flags has 16 bits, of which 7 don't yet have a meaning.
You might even be able to combine one new bit with BTP_HALF_DEAD, to
limit the need for new bits to just one: presumably, pages cannot be
both HALF_DEAD and MERGED/MERGED_AWAY at the same time.
Kind regards,
Matthias van de Meent
PS.
Some food for thought and questions in return:
1.) Page space usage
The idea that you shared in [0] described a page-level "Merge ID".
Adding this to the page would affect the amount of space available on
btree pages (for at least merged btree pages), and that seems to imply
a possibly backward-incompatible change to currently indexed tables
(less space on a page available -> smaller max tuple size -> current
indexed data may not be reindexable with the new checks in place).
Have you thought about this, and/or figured out whether the decrease
in page space is relevant or can safely be worked around?
2.) changes to page lifecycle operations
There are various concurrent operations that may impact scan
consistency. You've covered how to recover from merges with documents
linked to from your earlier mail, but it's not clear to me if these
things work without causing issues w.r.t. scan output:
a. Deleting a BTP_MERGED page
When vacuum runs a second time all items on the target page may
now be removable; can that page be deleted safely?
b. Deleting a BTP_MERGED_AWAY page
BTP_MERGED_AWAY pages keep a copy of their old tuples around,
which you mention are used by backward scans. This means its contents
must also be cleaned up by subsequent VACUUM runs, as the backwards
IOS may otherwise return TIDs that have been recycled and recieved new
indexed values. This cleanup can result in an empty page - which can
happen earlier than the XID horizon in the MergeID. Does this design
allow those pages to be reclaimed?
c. A BTP_MERGED page must be split
What happens? And what happens if it keeps splitting, before any
further vacuum (maintenance) operations can affect the merged
contents?
d. Are BTP_MERGED_AWAY pages still part of the data structure?
So, is L still pointed to by both L's left sibling and R, or is it
immediately removed from the structure (or at least as immediate as a
HALF_DEAD page would)?
If it's kept in the structure for extended periods: Why?
e. When/how does VACUUM determine to recycle the BTP_MERGED_AWAY
page (mark it BTP_HALF_DEAD or BTP_DELETED)?
The docs [1] linked from [0] indicate MergeID is present only on
BTP_MERGED pages, so it can't be used to locally determine to start
the recycling process for a BTP_MERGED_AWAY page. What process is used
to clean up this index page?
3.) General performance
I'm worried about the cost of maintaining the scan boundary;
especially for index-only scans. Though it's likely you'll only have
to keep track of this at page boundaries, it's still a new and
possibly significant overhead; index key boundaries are not always as
small as int4/int8 + TID.
4.) Workload
Do you have an example workload that you expect to show improvements?
5.) Inner pages
Are inner pages ever considered for merging, or are you only
targeting leaf pages?
[0]: https://postgr.es/m/CANBEAPFW64QUGmMYoPekV6jGrn5Yx8kTNfeEtcTG9CCBUUBoLQ@mail.gmail.com
[1]: https://docs.google.com/document/d/1PIMG0N__4BIB0uDWOfcK-AruatNL8SCTlfTlBTCaoCo/edit?tab=t.0
| From | Date | Subject | |
|---|---|---|---|
| Next Message | John Naylor | 2026-05-12 11:51:50 | Re: Add a greedy join search algorithm to handle large join problems |
| Previous Message | Henson Choi | 2026-05-12 11:39:04 | Re: Remove invalid SS2/SS3 handling from EUC-KR routines |