Re: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()
Date: 2020-01-27 23:01:51
Message-ID: CAH2-WzkwFL1kz40on+7Rzmi1b2Gf8g+KXNXjwch=_GKWkDO+mA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jan 27, 2020 at 9:14 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > The main idea here is to make _bt_compare() delay
> > calling BTreeTupleGetNAtts() until the point after the first attribute
> > turns out to be equal to the _bt_compare() caller's insertion scankey.
> > Many or most calls to _bt_compare() won't even need to call
> > BTreeTupleGetNAtts().
>
> FWIW, I still think it might be better to *continue* loading ntupatts
> where it's currently done, but keep the the change to the loop
> termination the way you have it. That way we don't add a branch to check
> for ntupatts, and because we don't depend on the result to enter the
> loop, we don't have a stall. I'd even keep the Min() bit. I'm a bit
> afraid that delaying the load will add one (smaller) stall after the key
> comparison, and that the additional branches will be noticable too.

I can do it that way. I am not attached to the current approach in
0001-* at all.

> My intuition is that a binary search optimized layout (next para) will
> be a bigger win, and probably easier. There are pretty clear profiling
> indicators that even the access to the ItemId array in the binary search
> is most of the time a cache miss and causes a stall - and it makes sense
> too.
>
> I.e. instead of a plain sorted order, store the n ItemIds on a page
> in the order of
> [1/2 n, 1/4 n, 3/4 n, 1/8 n, 3/8 n, 5/8 n, 7/8 n, ...]
> as binary search looks first at 1/2, then at either 1/4 or 3/4, then
> either (1/8 or 3/8) or (5/8, 7/8), and so on, this layout means that the
> commonly the first few four levels of the ItemId array are on a *single*
> cacheline.

You don't really have to convince me of anything here. I see these as
essentially the same project already. I am only really emphasizing the
abbreviated keys thing because it's obviously unbeatable with the
right workload.

Working on B-Tree stuff for v12 really convinced me of the value of an
integrated approach, at least in this area. Everything affects
everything else, so expanding the scope of a project can actually be
really helpful. It's very normal for these optimizations to be worth a
lot more when combined than they are worth individually. I know that
you have had similar experiences in other areas of the code.

> I think this works particularly well for inner btree pages, because we
> don't rewrite their itemid lists all that often, so the somewhat higher
> cost of that doesn't matter much, and similarly, the higher cost of
> sequentially iterating, isn't significant either.

Obviously all of these techniques are only practical because of the
asymmetry between leaf pages and internal pages. Internal pages are
where the large majority of comparisons are performed in most OLTP
workloads, and yet their tuples are often only about one third of one
percent of the total number of tuples in the B-Tree. That is the
specific ratio within the pgbench indexes, IIRC. Having more than one
percent of all tuples come from internal pages is fairly exceptional
-- you only really see it in indexes that are on very long text
strings.

> I do completely agree that having a small high-entropy abbreviated key
> inside the ItemId would be an awesome improvement, as it can entirely
> remove many of the hard to predict accesses. My gut feeling is just that
> a) it's a pretty darn hard project.
> b) it'll be a smaller win as long as there's an unpredictable access to
> the abbreviated key

It will be relatively straightforward to come up with a basic
abbreviated keys prototype that targets one particular data
distribution and index type, though. For example, I can focus on your
pgbench SELECT workload. That way, I won't have to do any of the hard
work of making abbreviated keys work with a variety of workloads,
while still getting a good idea of the benefits in one specific case.
For this prototype, I can either not do prefix compression to get a
high entropy abbreviated key, or do the prefix compression in a way
that is totally inflexible, but still works well enough for this
initial test workload.

My estimate is that it would take me 4 - 6 weeks to write a prototype
along those lines. That isn't so bad.

--
Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-01-28 00:12:46 New compiler warning in jsonb_plpython.c
Previous Message Justin Pryzby 2020-01-27 22:50:18 Re: error context for vacuum to include block number