| From: | Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com> |
|---|---|
| To: | Salma El-Sayed <salmasayed182003(at)gmail(dot)com> |
| Cc: | Peter Geoghegan <pg(at)bowt(dot)ie>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kirk Wolak <wolakk(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org |
| Subject: | Re: [GSoC 2026] - B-tree Index Bloat Reduction - Approach & Questions |
| Date: | 2026-06-30 23:04:46 |
| Message-ID: | CAEze2Wi9pfMG7n-CsPT0XuC5n6Qi=6b1PvQA_vSLpPTWE3kQcg@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Tue, 30 Jun 2026 at 19:37, Salma El-Sayed <salmasayed182003(at)gmail(dot)com> wrote:
>
> Hi hackers,
>
> We want to share a new design we are working on for handling page merges.
>
> Instead of using BTP_MERGED_AWAY simply as a cache, this design
> repurposes it to direct scans for merge recovery and to hold the
> merge_xid in the same way deleted pages hold safexid.
[... scan mechanism stuff, which I'll comment on below]
> *VACUUM AND FLAG LIFECYCLE*
>
> MERGED pages cannot be merged again until the flag is reset.
>
> The MERGED_AWAY flag serves as a separator between merged groups and
> directs scans to recover from the merge operation. Therefore,
> MERGED_AWAY must stick around until all MERGED flags in its merge
> group are reset.
This seems like it's the right idea.
> Any MERGED page can be deleted at any point if VACUUM decides to
> delete it, without breaking the algorithm. It is impossible to have a
> normal page between MA and M pages. Having an HD page in between is
> not an issue as long as the MERGED flag is never reset in that HD
> page.
What happens when MERGED_AWAY's only MERGED sibling is deleted?
Presumably MERGED_AWAY can be cleaned up then, too, because its tuples
were removed, too?
> We can use the MERGED_AWAY page to save merge_xid as the page content,
> exactly the same way safexid works for deleted pages.
Yep.
> When VACUUM steps into a MERGED_AWAY page and the transaction horizon
> has passed, it MUST reset all MERGED flags first. VACUUM must locate
> the end of the MERGED group, then walk backward to reset all pages
> (similar to how the backward scan recovers). This guarantees a normal
> page will never exist between MA and M pages.
Yes.
> * Resetting a BTP_MERGED flag means clearing it from btpo_flags,
> leaving the page as a standard BTP_LEAF page again.
> * Resetting a BTP_MERGED_AWAY will make it HALF_DEAD.
In general I agree with the page lifecycle mechanism; I think it's the
only way to make this work.
---------
Now, on to the scan component:
> This design is based on the fact that the currPos.items array is only
> overwritten by _bt_readpage, meaning the data remains intact until the
> next _bt_readpage call for the subsequent page.
>
> The core changes are centralized in _bt_readnextpage.
>
> The following fields are added to BTScanOpaqueData:
> bool skipMergeRecovery;
> bool needMergeRecovery;
> ItemPointerData savedMergeTids[MaxTIDsPerBTreePage];
> int nSavedMergeTids; /* number of TIDs in the array */
I think parallel btree scans need the [skip, need]MergeRecovery flags,
and possibly the item pointers themselves, too? I think that could be
a rather large amount of data for the parallel state.
Below some comments on the scan logic:
> Here is how the scan logic works across a sequence of pages:
>
> *FORWARD SCAN*
>
> Scenario: L(MA) <--> R(M) <--> A(M) <--> B(MA) <--> C(HD) <--> D(M)
> (MA = MERGED_AWAY, M = MERGED, HD = HALF_DEAD)
>
> subsequent MERGED pages happen due to the split of page R.
>
> 1) The scan steps into L and finds the MA flag. It skips L and sets
> skipMergeRecovery = true.
>
> 2) The scan steps into R and finds the M flag. It checks skipMergeRecovery:
> a) If true: It reads R as a normal page. Page A is treated the same way.
This can be unsafe if you don't check that L is still the left-link of
R: If L got deleted (its fxid horizon expired and R got a new left
sibling) and the new left sibling got merged into R before we scan it,
then the data on L could contain items that we'd already scanned when
we accessed the now newly minted MERGED_AWAY page that's now R's left
sibling.
This can be detected by detecting that L is no longer the left sibling
(and "merge group separator") of R. Note that R (in this specific case
of R being the immediate sibling of L) cannot get a new left sibling
other than through concurrent deletion of L, which can allow the
former sibling of L to be merged into R.
> b) If false:
> - currPos.items still holds the tuples read from L. We save
> these items into savedMergeTids (we only do this step for the first M
> page in the merge group).
How do you detect it is the first page in the merge group? What if you
land on an M page after descending the tree?
> - Read R by calling _bt_readpage.
> - Loop through currPos.items for R and remove any item that
> exists in savedMergeTids.
> - Return tuples to the caller if any remain after filtering.
> Otherwise, continue to the next M page in the group.
> 3) The scan steps into B (MA) and does the same as L.
> 4) The scan steps into C (HD) and skips it.
Shouldn't this page C be HD+M, given that it is a merged page in the
process of being deleted? Or did I miss something?
> 5) The scan steps into D (M) and does the same as R.
>
> skipMergeRecovery is reset to false as soon as the scan steps onto a
> normal (non-M, non-MA) page, making the scan ready for the next merge
> group.
(see my comment above for the race condition that should be addressed)
> *BACKWARD SCAN*
>
> Scenario: L(MA) <--> R(M) <--> A(M) <--> B(M) <--> C(normal page)
>
> Definitions:
> - skipMergeRecovery: The scan read a M page after a merge occurred,
> requiring no recovery.
> - needMergeRecovery: The scan read a M page before the merge, then
> stepped into an MA page. The flag is set to true to start recovery.
> Stepping into an MA page again while true means recovery is finished.
>
> 1) The scan steps into L:
> a) If skipMergeRecovery OR needMergeRecovery is true: The scan read
> merged pages and will skip this MA page. Reset both booleans to be
> ready for the next merged group.
What if L and R were M pages when the scan accessed R, but
subsequently their MA page (left of L) was deleted and L got merged
into R?
Then we'd skip the keyspace that was merged from L into R, right? I
think it's the same (but inverse) issue as I'd mentioned above, but
more difficult to correctly detect because the siblings pointers of R
don't change, nor do L's.
> b) If neither is true: The scan read R before the merge, and
> currPos.items still has the data read from it.
> - Save these items into savedMergeTids.
> - Walk forward until reaching C (the first page with no M flag).
> - Set blkno = B, and lastcurrblkno = C (needed to recover from
> splits in the backward scan).
> - The scan continues walking backward from B.
This (b) confused me a bit.
IIUC, the order of operations for this condition to be hit is:
1. R was scanned;
2. L got merged into R; [L=MA, R=M]
3. R split into R, A, and B; [R=M, A=M, B=M]
4. the scan accesses L (and condition B is met, and the described
branch is executed).
Is that right?
> 2) The scan steps into B, A, or R:
> a) If needMergeRecovery is false: The simplest case. Read the page
> normally and set skipMergeRecovery = true, so that when the scan steps
> left onto the MA tombstone, it knows to skip it cleanly.
> b) If needMergeRecovery is true: The scan is in merge recovery.
> - Call _bt_readpage to read the page.
> - Loop through currPos.items and remove any item that exists in
> savedMergeTids.
> - Return tuples to the caller if any remain after filtering.
> Otherwise, continue to the next M page in the group.
... and this is safe, because all M pages must be split _after_ the
merge happened, so they all contain the keyspace of [L+R], and the
tuples of R we'd already scanned are excluded. Am I understanding this
correctly?
-------
I have thought up an alternative approach with an incomplete scan
approach (more about "partial" below) that doesn't need to backtrack
and shouldn't have concurrent merge problems when we're already
scanning a merged range:
When merging pages, we attach the MergeID (MA block number, but
optionally with its NextFullTransactionId recycling horizon) to each
of the tuples we move into the right page (as "MERGED_AWAY_TUPLE"
tuples), in a way similar to how we have posting list tuples
(INDEX_ALT_TID_MASK+BT_IS_POSTING) and how we add TIDs to pivot tuples
(INDEX_ALT_TID_MASK+BT_PIVOT_HEAP_TID_ATTR). Vacuum can clean these
tuples away as normal, the merge id is used only by scans to determine
which page the merged tuple came from and decide whether to include it
based on that information.
This can then be used by forward scans to determine that it needs to
skip these tuples, when it did not see the MERGED_AWAY page _after_ it
was merged.
For backward concurrent-merge-recovery, we would keep a copy of the
old tuples around on the MERGED_AWAY page; whilst allowing VACUUM to
reclaim those tuples. These tuples will be accessed only by backward
scans who were stepping from R to L whilst the merge was going on.
Vacuum cleanup of the tuples can turn the page HALF_DEAD once the last
tuple is deleted, which can get the page removed from the tree
structure before the full expiration of the FXID horizon - a minor but
important optimization: An MA page without this would remain "in use"
for two vacuum horizons' timespans, whereas a normal page deletion
"only" takes one. Merging *could* therefore cause a page to take about
one more vacuum horizon in time before it is free for reuse, and so it
could hinder (not help) reclamation of pages.
Cleaning up the MERGED flag also means cleaning/removing the
MERGED_AWAY_TUPLE status off of any index tuples on those MERGED
pages. All MERGED_AWAY_TUPLE tuples must have the block id of the
associated MERGED_AWAY page.
So, we'd have the following:
BTScanPosData {
+ BlockNumber last_merged_away; /* FWD scans */
+ BlockNumber last_known_merged; /* BWD scans */
}
For FWD scans, we'd add the following ahead of processing any data on a page:
IF page is MERGED_AWAY
scan->last_merged_away = currentBlockNo;
(ignore the page)
ELSEIF page is MERGED
scan->last_known_merged = currentBlockNo;
read page data
ELSE /* page neither MERGED_AWAY nor MERGED: normal, or HALF_DEAD */
/* reset merge-related scan state */
scan->last_merged_away = InvalidBlockNo.
scan->last_known_merged = InvalidBlockNo;
read page data
ENDIF
Adjust _bt_readpage with
[...]
IF page_item is MERGED_AWAY_TUPLE and page_item->merged_away_block
!= scan->last_merged_away:
skip page_item; [^1]
ENDIF
[...]
[^1]: This allows us to avoid returning duplicates when the following
pattern occurs
(start): L1 (MA) <-> R1(M) <-> R2 (M),
s1: Scan L1
s1: Scan R1
[A]
s1: Scan R2
[B]
s2: Cleanup L1;
s2: Merge R1+R2;
(state: R1 (MA) <-> R2 (M))
There is no externally apparent distinction for s1 between when the
steps for s2 happen at [A] vs [B]; R2's sibling pointers would be the
same.
R1's tuples having been tagged with its block number allows us to
remove them from the set, as the scan knew about L1's merge, but not
R1's, and thus can skip R1's merged items (any tuples from that page
were either already returned or recently concurrently inserted).
For full BWD scans, you'd add the following to _bt_readpage:
IF page is MERGED
scan->last_known_merged = currentBlockNo;
_bt_readpage() as normal, if not deleted
ELSEIF page is MERGED_AWAY
scan->last_merged_away = currentBlockNo;
IF scan->last_known_merged != block->right_sibling
Scan page->remaining_items as if they were live /* recovers from
the concurrent merge */
ENDIF
ELSE /* page neither MERGED_AWAY nor MERGED: normal, or HALF_DEAD */
/* reset merge-related scan state */
scan->last_merged_away = InvalidBlockNo.
scan->last_known_merged = InvalidBlockNo;
_bt_readpage() as normal
ENDIF
Note: There is one major item in this algo that I haven't yet
considered and that would definitely cause issues: What happens when
you change scan directions somewhere in between pages in a merge
range? As written, it'd cause the tuples from the merge range's
MERGED_AWAY page to be skipped when going FWD, whilst they'd be
included in the BWD portion of the scan, and we really can't have
that. Something that's related to this, is determining what should
happen when you start your scan on a MERGED page, which I've also not
yet figured out.
I think there is an elegant solution to be found to these issues, but
I have not figured the details in the hours I've spent thinking about
this approach. The main issue (I think) is figuring out how to
correctly invert the direction of the scan; I suspect it needs a bit
more state tracking but I'm yet not sure what and where.
Overall, it's really great to see work being done on this hard
problem. Good job so far!
Kind regards,
Matthias van de Meent
Databricks (https://www.databricks.com)
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Michael Paquier | 2026-06-30 23:35:12 | Re: Handle concurrent drop when doing whole database vacuum |
| Previous Message | surya poondla | 2026-06-30 23:03:14 | Re: Handle concurrent drop when doing whole database vacuum |