Abbreviated keys in nbtree internal pages

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Andres Freund <andres(at)anarazel(dot)de>
Subject: Abbreviated keys in nbtree internal pages
Date: 2017-06-27 19:20:10
Message-ID: CAH2-Wz=mV4dmOaPFicRSyNtv2KinxEOtBwUY5R7fXXOC-OearA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

Over on the "memory layouts for binary search in nbtree" thread, I
described a plausible way of implementing abbreviated keys for nbtree
internal pages [1]. I've been going on about this technique for a long
time, but the insight that we can do it by reusing an itemId's lp_len
field is a new one. I'm encouraged by the fact that I got a POC patch
to work quite easily.

This optimization is all about reducing the number of cache misses
within _bt_binsrch(). We'd do this for internal pages only. The big
idea is that binary searches may only have to access the ItemId array
on internal pages, which is often less than 1/8 of the size of the
page, and sometimes only 1/16 (consider internal page fillfactor).

Areas of concern/interest include:

* How to implement an encoding scheme to make the best of the 15 bits
we'd have for this. That could involve hard trade-offs for something
like numeric. Note that I'm not interested in making any encoding
scheme that is adaptive -- simplicity demands that we implement
totally generic encoding schemes, I think. We'd also make the
comparisons totally generic unsigned integer comparisons, that are
hard coded within _bt_compare().

15 bits is more than it sounds. It would be useless for sorting, but
within B-Trees things are quite different. You can amortize the cost
of generating the key over a very long period of time, and abbreviated
keys are only needed for internal pages (typically perhaps 1 in 50 -
200 leaf tuples for text). You only need to generate a new abbreviated
key during a leaf page split (page splits of internal pages just reuse
the existing one). And, most importantly, internal pages naturally
represent dispersed values. You've an excellent chance at resolving
comparisons with only one byte abbreviated key comparisons within the
root page. 15 bits gets you surprisingly far.

* What other code this optimization might break.

* Workloads helped by the optimization, that should be tested.

* Whether it's okay that some workloads will not be helped at all. The
hope here is that branch prediction and the fact that we're not
accessing stuff not already in a cache line means those cases don't
hurt either.

* We'll want to do this for text's default B-Tree opclass, with a real
collation, since that is where it would really matter. I think there
are concerns about the stability of something like a strxfrm() blob
over and above the concerns we have for sort support. You don't just
have to worry about the stability of the behavior of collations -- you
also have to worry about the stability of actual blobs on disk, and
that a technical enhancement to the collator algorithm could break
things that wouldn't break sort support.

ICU has facilities for this, and the ICU talks about storing strxfrm()
blobs on disk for a long time, but the details would need to be worked
out.

[1] postgr.es/m/CAH2-WzmW9LjTfzp7uY4CHsF3NH0YM-_pow3LXRh18ORLgeu48A@mail.gmail.com
--
Peter Geoghegan

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2017-06-27 20:05:44 Re: Reducing pg_ctl's reaction time
Previous Message Tom Lane 2017-06-27 18:59:18 Re: Reducing pg_ctl's reaction time