Re: Randomize B-Tree page split location to avoid oscillating patterns

From: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Randomize B-Tree page split location to avoid oscillating patterns
Date: 2026-06-16 10:13:47
Message-ID: jotfal6422zpqvux4add7xcyftaujjiksscfcc5uuibaldw4sv@pzriktpx4gyl
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Thu, May 07, 2026 at 08:10:30PM +0200, Dmitry Dolgov wrote:
>
> To summarize, in comparison with the second approach the third one seems to
> have less overhead, the same guarantees regarding penalty, and more complex
> implementation. I'm fine going with the second approach, but first would like
> to discuss the alternatives and make sure we're on the same page.

Here is a sketch of what I had in mind. First two patches are metrics
only (adding USDT, I think 0001 makes sense to have in general, 0002 is
purely for benchmarking). The third patch implements strategy of
collecting equally good split points and choosing from them, and an
alternative version does the same via shifting the search interval
(attached with the txt suffix to not confuse the CF bot, whatever it
does nowadays).

I've managed to perform the same test as before (a best case scenario,
single column unique index) with a bit more throughput, and both
versions showed the same improvement over the main branch (see the
splits.png, where the patch versions are marked as "Rand (best loc)" and
"Rand (shifted)" for searching equally good split points and shifted
search interval correspondingly). It was also enough to produce visible
improvement for latency stddev as reported by pgbench (see stddev.png).

Measuring latency between newly introduced USDT inside _bt_findsplitloc
in this test shows that the latter version (shifted interval) is marginally
faster, probably due to the former needing to remember new data (equally good
split points):

// Time spent in _bt_findsplitloc for "shifted" version

@nsecs:
[4, 8) 2 | |
[8, 16) 9433 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[16, 32) 11365 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32, 64) 3940 |@@@@@@@@@@@@@@@@@@ |
[64, 128) 1255 |@@@@@ |
[128, 256) 595 |@@ |
[256, 512) 72 | |
[512, 1K) 0 | |
[1K, 2K) 0 | |
[2K, 4K) 1 | |

// Time spent in _bt_findsplitloc for "equally good points" version

@nsecs:
[4, 8) 1 | |
[8, 16) 8265 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[16, 32) 10391 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32, 64) 4982 |@@@@@@@@@@@@@@@@@@@@@@@@ |
[64, 128) 1155 |@@@@@ |
[128, 256) 698 |@@@ |
[256, 512) 79 | |
[512, 1K) 2 | |
[1K, 2K) 0 | |
[2K, 4K) 10 | |

Attachment Content-Type Size
v2-0001-Add-USDT-for-nbtree-page-splits-and-suffix-trunca.patch text/plain 2.5 KB
v2-0002-Add-USDT-for-split-location-benchmarking.patch text/plain 1.6 KB
v2-0003-Randomize-nbtree-split-location-to-avoid-oscillat.patch text/plain 4.4 KB
v2-0003-Randomize-nbtree-split-location-to-avoid-oscillat.patch.txt text/plain 4.5 KB
image/png 59.0 KB
image/png 47.8 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Ewan Young 2026-06-16 10:27:19 Re: GRAPH_TABLE: aggregates/window/set-returning functions in COLUMNS crash the backend
Previous Message Bertrand Drouvot 2026-06-16 10:09:48 Re: Avoid orphaned objects dependencies, take 3