Randomize B-Tree page split location to avoid oscillating patterns

From: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Randomize B-Tree page split location to avoid oscillating patterns
Date: 2026-04-27 16:24:13
Message-ID: d6do2mtjcsagn37jf6pjywzhlzyokqja6jlnvcs4ypkvnnuu32@llwuuyxxomc3
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

TL;DR There seems to be a known phenomenon, where a data ingestion into a
B-Tree produces page splits following an oscillating pattern, which in turn
affects IO and buffer contention, impacting the performance. It turns out that
PostgreSQL is not an exception, but it should be possible to randomize a split
location a bit to mitigate the issue.

Hi,

Some time ago while working on models for PostgreSQL performance [1] I've
stumbled upon an interesting oscillating patters around various B-Tree metrics
(number of page splits, index size, etc). This turned out to be a known thing
described in a catchy way as "Waves of misery" [2], and boils down to the fact
that a fixed split location is usually chosen for a page split -- in this case
probabilities of page split lead to oscillating solutions under certain
workloads. Looks like the only pre-condition is that the data to be ingested
has the same distribution as the data already existing in the tree, in
particular UUIDs are in a bad position for this. And of course it's possible to
reproduce this with PostgreSQL.

Such an oscillation can lead to variability in IO and buffer contention,
negatively impacting performance. The fillfactor only shifts the waves, but do
not cancel them. One of the proposed remediation for this is to do suffix
truncation, which will "spread" split locations across some range. While we do
column suffix truncation, it turns out to be not enough for many workloads.
Another option is to randomize the split location, e.g. pick the actual
location from a range of 20% around the best one. In our case it's easy to do
randomization based on the split state, as all the possible split locations are
sorted by delta -- and all what's needed is to add a shift to the lowsplit from
a range, based on the number of split locations.

I've done few experiments with this, here is how it looks like:

* The first one is synthetic: a single column table with integer values, a
B-Tree index over it with the fillfactor=100, inserting new values one by one
from a uniform distribution via PGBench. In the graph "split.png" you can see
the number of page splits over time for the main branch and the patch (named
"Main" and "Rand" correspondingly), and the oscillating is clear for the
former one.

* Another one is following a data schema from one real projects out there, with
UUIDs as index values and very large records, with the default fillfactor=90.
The data ingestion is happening in large batches. The graph "split_batch.png"
represent the data for this case, with the main branch oscillating much more
than the randomized.

The unfortunate part is that I couldn't get clear numbers for the performance
impact. Turns out the disk in my experimental setup is not good enough to get a
sufficient number of inserts to trigger the issue, and to get nice graphs I was
running everything either on a RAM disk or on an unlogged table -- in both
cases it's easy to observe oscillations of page splits, but their impact is not
large enough since only so much IO is happening.

But anyway, any thoughts / commentaries on that?

[1]: https://zenodo.org/records/15786156
[2]: Glombiewski N., Seeger B., Graefe G. (2019). Waves of Misery After Index
Creation. BTW 2019. Gesellschaft für Informatik. doi:10.18420/btw2019-06

Attachment Content-Type Size
v1-0001-Add-a-USDT-for-tracing-nbtree-page-splits.patch text/plain 1.5 KB
v1-0002-Randomize-nbtree-split-location-to-avoid-oscillat.patch text/plain 3.2 KB
image/png 47.0 KB
image/png 55.2 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2026-04-27 16:24:14 Re: repack: fix a bug to reject deferrable primary key fallback for concurrent mode
Previous Message Dilip Kumar 2026-04-27 16:21:00 Re: Proposal: Conflict log history table for Logical Replication