[WiP] B-tree page merge during vacuum to reduce index bloat

From: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Kirk Wolak <wolakk(at)gmail(dot)com>, Nikolay Samokhvalov <nik(at)postgres(dot)ai>
Subject: [WiP] B-tree page merge during vacuum to reduce index bloat
Date: 2025-08-26 09:40:03
Message-ID: 11A59C0C-A8C8-4642-8493-292D5DF8311D@yandex-team.ru
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers,

Together with Kirk and Nik we spent several online hacking sessions to tackle index bloat problems [0,1,2]. As a result we concluded that B-tree index page merge should help to keep an index from pressuring shared_buffers.

We are proposing a feature to automatically merge nearly-empty B-tree leaf pages during VACUUM operations to reduce index bloat. This addresses a common issue where deleted tuples leave behind sparsely populated pages that traditional page deletion cannot handle because they're not completely empty.

*** Implementation Overview:

The patch adds a new `mergefactor` reloption (default 5%, range 0-50%) that defines when a page becomes a candidate for merging. During vacuum, when a page exceeds this threshold (e.g., 95% empty with default settings), we attempt to move the remaining tuples to the right sibling page and then delete the source page using existing page deletion mechanisms.

Key changes:
- New `mergefactor` reloption for configurable merge thresholds
- Detection logic in `btvacuumpage()` to identify merge candidates
- Tuple movement implementation in `_bt_unlink_halfdead_page()`
- WAL logging enhancements to handle cross-page dependencies during replay

The last part needs further improvements (it's simply REGBUF_FORCE_IMAGE for now), but I want to start a discussion and ask for known problems of the approach.

*** Correctness:

The implementation reuses existing locking, critical sections and WAL logging infrastructure. To handle cross-page dependencies during WAL replay, when tuples are merged, the right sibling buffer is registered with `REGBUF_FORCE_IMAGE`, this is a temporary workaround.

*** Current Status & Questions:

The patch successfully reduces index bloat and handles basic scenarios, but we've identified some areas that need community input:

1. Backward Scan Correctness: The primary concern is ensuring backward scans work correctly when pages are being merged concurrently. Since we maintain the same locking protocol as existing page deletion, I believe this should be safe, but would appreciate expert review of the approach.
2. Performance Impact: The additional checks during vacuum have minimal overhead, but broader testing would be valuable. Worst case would be the index with leaf pattern (5%,96%,5%,96%,5%,96%...). We will attempt to merge it every time spending time on acquiring locks.
3. WAL Consistency: There are still some edge cases with WAL consistency checking that need refinement. I think I can handle it, just need to spend enough time on debugging real redo instead of imaging right page.

*** Usage:
CREATE INDEX ON table (col) WITH (mergefactor=10); -- 10% threshold
I don't know if it would be a good idea to enable mergefactor for existing indexes.

The feature is particularly beneficial for workloads with high update/delete rates that create sparse index pages without triggering complete page deletion.

I'm attaching the patch for review and would welcome feedback on the approach, especially regarding backward scan safety and any other correctness concerns we may have missed.

Thank you!

Best regards,
Andrey, Nik, Kirk.

[0] https://www.youtube.com/watch?v=3MleDtXZUlM
[1] https://www.youtube.com/watch?v=Ib3SXSFt8mE
[2] https://www.youtube.com/watch?v=D1PEdDcvZTw

Attachment Content-Type Size
v1-0001-btree-Add-page-merge-during-vacuum-to-reduce-inde.patch application/octet-stream 24.8 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bertrand Drouvot 2025-08-26 09:44:08 Re: Per backend relation statistics tracking
Previous Message shveta malik 2025-08-26 09:32:40 Re: POC: enable logical decoding when wal_level = 'replica' without a server restart