Quick Links

Re: Randomize B-Tree page split location to avoid oscillating patterns

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Randomize B-Tree page split location to avoid oscillating patterns
Date:	2026-04-27 18:07:14
Message-ID:	CAH2-Wz=Sy7=kAHrtNrDSoo+vj_QYAmNc8_GEYPh6r+0r-BbQ5w@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Mon, Apr 27, 2026 at 12:24 PM Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:
> Some time ago while working on models for PostgreSQL performance [1] I've
> stumbled upon an interesting oscillating patters around various B-Tree metrics
> (number of page splits, index size, etc). This turned out to be a known thing
> described in a catchy way as "Waves of misery" [2], and boils down to the fact
> that a fixed split location is usually chosen for a page split -- in this case
> probabilities of page split lead to oscillating solutions under certain
> workloads.

I'm quite familiar with the paper, and find its argument convincing.
I'm not sure of the overall importance of the issues that they
describe, but the phenomenon is obviously real.

> Such an oscillation can lead to variability in IO and buffer contention,
> negatively impacting performance. The fillfactor only shifts the waves, but do
> not cancel them. One of the proposed remediation for this is to do suffix
> truncation, which will "spread" split locations across some range. While we do
> column suffix truncation, it turns out to be not enough for many workloads.
> Another option is to randomize the split location, e.g. pick the actual
> location from a range of 20% around the best one. In our case it's easy to do
> randomization based on the split state, as all the possible split locations are
> sorted by delta -- and all what's needed is to add a shift to the lowsplit from
> a range, based on the number of split locations.

My interpretation is that randomization can be combined with the usual
suffix truncation split point choosing heuristics; the paper even
recommends this at one point. That is, you randomly pick from a small
number of candidate split points that are all approximately equally
good according to the traditional criteria. It looks like your patch
doesn't account for suffix truncation at all, though.

Your patch should initially look for several split points that are
approximately equal in quality according to the current criteria -- a
separate, initial pass. You'd then randomly pick a final split point
from those gathered during this initial pass -- not from the original
fillfactormult-sorted list of split points. What you have in v1 will
make suffix truncation much less effective, which seems unacceptable.

Another issue is that nbtsort.c doesn't have any ability to pick among
split points; it focuses entirely on keeping free space balanced. To
get much benefit from this, I think you'd have to teach nbtsort.c to
also pick split points using approximately the same algorithm as
nbtsplitloc.c would (assuming retail inserts in ascending key space
order). I've been meaning to add proper suffix truncation to the
CREATE INDEX case.

> I've done few experiments with this, here is how it looks like:
>
> * The first one is synthetic: a single column table with integer values, a
> B-Tree index over it with the fillfactor=100, inserting new values one by one
> from a uniform distribution via PGBench. In the graph "split.png" you can see
> the number of page splits over time for the main branch and the patch (named
> "Main" and "Rand" correspondingly), and the oscillating is clear for the
> former one.

I don't think that fillfactor=100 is very interesting in general.

> The unfortunate part is that I couldn't get clear numbers for the performance
> impact. Turns out the disk in my experimental setup is not good enough to get a
> sufficient number of inserts to trigger the issue, and to get nice graphs I was
> running everything either on a RAM disk or on an unlogged table -- in both
> cases it's easy to observe oscillations of page splits, but their impact is not
> large enough since only so much IO is happening.
>
> But anyway, any thoughts / commentaries on that?

I wouldn't expect the patch to increase absolute throughput
significantly, if at all. Its value comes from making the *rate* of
splits over time more consistent, for a given fixed workload. You
might notice a more interesting effect if you look at latency,
particularly worst case latency.

--
Peter Geoghegan

In response to

Randomize B-Tree page split location to avoid oscillating patterns at 2026-04-27 16:24:13 from Dmitry Dolgov

Responses

Re: Randomize B-Tree page split location to avoid oscillating patterns at 2026-04-27 19:49:57 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Nathan Bossart	2026-04-27 19:43:20	Re: [BUG?] macOS (Intel) build warnings: "ranlib: file … has no symbols" for aarch64 objects
Previous Message	Alexander Lakhin	2026-04-27 18:00:00	Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process