Re: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

From: Andres Freund <andres(at)anarazel(dot)de>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()
Date: 2020-01-27 17:14:05
Message-ID: 20200127171405.q64kz63hg6cr4uqy@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2020-01-26 14:49:06 -0800, Peter Geoghegan wrote:
> Andres and I discussed bottlenecks in the nbtree code during the
> recent PgDay SF community event. Andres observed that the call to
> BTreeTupleGetNAtts() in _bt_compare() can become somewhat of a
> bottleneck. I came up with the very simple attached POC-quality
> patches, which Andres tested and profiled with his original complaint
> in mind yesterday. He reported increased throughput on a memory
> bound simple pgbench SELECT-only workload.

Yea - it shows up as a pipeline stall, because the loop depends on
having loaded the tuple. Which basically requires two
unlikely-to-be-cached memory loads to complete. Whereas before/after the
patcha good bit of that latency could be hidden by out-of-order
execution, as e.g. the tupledesc and scankey accesses are not dependent
on the memory load for the tuple having finished.

> The main idea here is to make _bt_compare() delay
> calling BTreeTupleGetNAtts() until the point after the first attribute
> turns out to be equal to the _bt_compare() caller's insertion scankey.
> Many or most calls to _bt_compare() won't even need to call
> BTreeTupleGetNAtts().

FWIW, I still think it might be better to *continue* loading ntupatts
where it's currently done, but keep the the change to the loop
termination the way you have it. That way we don't add a branch to check
for ntupatts, and because we don't depend on the result to enter the
loop, we don't have a stall. I'd even keep the Min() bit. I'm a bit
afraid that delaying the load will add one (smaller) stall after the key
comparison, and that the additional branches will be noticable too.

> Andres and I both agree that there is a lot more work to be done in
> this area, but that will be a major undertaking. I am quite keen on
> the idea of repurposing the redundant-to-nbtree ItemId.lp_len field to
> store an abbreviated key. Making that work well is a considerable
> undertaking, since you need to use prefix compression to get a high
> entropy abbreviated key. It would probably take me the best part of a
> whole release cycle to write such a patch. The attached patches get
> us a relatively easy win in the short term, though.

My intuition is that a binary search optimized layout (next para) will
be a bigger win, and probably easier. There are pretty clear profiling
indicators that even the access to the ItemId array in the binary search
is most of the time a cache miss and causes a stall - and it makes sense
too.

I.e. instead of a plain sorted order, store the n ItemIds on a page
in the order of
[1/2 n, 1/4 n, 3/4 n, 1/8 n, 3/8 n, 5/8 n, 7/8 n, ...]
as binary search looks first at 1/2, then at either 1/4 or 3/4, then
either (1/8 or 3/8) or (5/8, 7/8), and so on, this layout means that the
commonly the first few four levels of the ItemId array are on a *single*
cacheline. Whereas in contrast, using the normal layout, that *never* is
the case for page with more than a few entries. And even beyond the
first few levels, the "sub" trees the binary search descends into, are
concentrated onto fewer cachelines. It's not just the reduced number of
cachelines touched, additionally the layout also is very prefetchable,
because the tree levels are basically laid out sequentially left to
right, which many cpu prefetchers can recognize.

I think this works particularly well for inner btree pages, because we
don't rewrite their itemid lists all that often, so the somewhat higher
cost of that doesn't matter much, and similarly, the higher cost of
sequentially iterating, isn't significant either.

Now that's only the ItemId array - whereas a larger amount of the cache
misses comes from the index tuple accesses. The nice bit there is that
we can just optimize the order of the index tuples on the page without
any format changes (and even the read access code won't change). I.e. we
can just lay out the tuples in an *approximately* binary search
optimized order, without needing to change anything but the layout
"writing" code, as the ItemId.lp_off indirection will hide that.

I do completely agree that having a small high-entropy abbreviated key
inside the ItemId would be an awesome improvement, as it can entirely
remove many of the hard to predict accesses. My gut feeling is just that
a) it's a pretty darn hard project.
b) it'll be a smaller win as long as there's an unpredictable access to
the abbreviated key

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-01-27 17:15:53 Re: JIT performance bug/regression & JIT EXPLAIN
Previous Message Robert Haas 2020-01-27 16:45:52 Re: pg_croak, or something like it?