Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>
Subject: Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
Date: 2019-03-04 21:16:49
Message-ID: CAH2-Wz=Dhkbguk31kCgs62Uj6cqsQn3GA3SYBDp-FWidKncwYA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Mar 3, 2019 at 5:41 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> Great, looks much simpler now, indeed! Now I finally understand the
> algorithm.

Glad to hear it. Thanks for the additional review!

Attached is v14, which has changes based on your feedback. This
includes changes based on your more recent feedback on
v13-0002-make-heap-TID-a-tie-breaker-nbtree-index-column.patch, though
I'll respond to those points directly in a later email.

v14 also changes the logic by which we decide if alternative strategy
should be used to use leftmost and rightmost splits for the entire
page, rather than accessing the page directly. We always handle the
newitem-at-end edge case correctly now, since the new "top down"
approach has all legal splits close at hand. This is more elegant,
more obviously correct, and even more effective, at least in some
cases -- it's another example of why "top down" is the superior
approach for nbtsplitloc.c. This made my "UK land registry data" index
have about 2.5% fewer leaf pages than with v13, which is small but
significant.

Separately, I made most of the new nbtsplitloc.c functions use a
FindSplitData argument in v14, which simplifies their signatures quite
a bit.

> What would be the worst case scenario for this? Splitting a page that
> has as many tuples as possible, I guess, so maybe inserting into a table
> with a single-column index, with 32k BLCKSZ. Have you done performance
> testing on something like that?

I'll test that (added to my project TODO list), though it's not
obvious that that's the worst case. Page splits will be less frequent,
and have better choices about where to split.

> Rounding up the allocation to BLCKSZ seems like a premature
> optimization. Even if it saved some cycles, I don't think it's worth the
> trouble of having to explain all that in the comment.

Removed that optimization.

> I think you could change the curdelta, leftfree, and rightfree fields in
> SplitPoint to int16, to make the array smaller.

Added this alternative optimization to replace the BLCKSZ allocation
thing. I even found a way to get the array elements down to 8 bytes,
but that made the code noticeably slower with "many duplicates"
splits, so I didn't end up doing that (I used bitfields, plus the same
pragmas that we use to make sure that item pointers are packed).

> > * You seemed to refactor _bt_checksplitloc() in passing, making it not
> > do the newitemisfirstonright thing. I changed that back. Did I miss
> > something that you intended here?
>
> My patch treated the new item the same as all the old items, in
> _bt_checksplitloc(), so it didn't need newitemisonright. You still need
> it with your approach.

I would feel better about it if we stuck to the same method for
calculating if a split point is legal as before (the only difference
being that we pessimistically add heap TID overhead to new high key on
leaf level). That seems safer to me.

> > Changes to my own code since v12:
> >
> > * Simplified "Add "split after new tuple" optimization" commit, and
> > made it more consistent with associated code. This is something that
> > was made a lot easier by the new approach. It would be great to hear
> > what you think of this.
>
> I looked at it very briefly. Yeah, it's pretty simple now. Nice!

I can understand why it might be difficult to express an opinion on
the heuristics themselves. The specific cut-off points (e.g. details
of what "heap TID adjacency" actually means) are not that easy to
defend with a theoretical justification, though they have been
carefully tested. I think it's worth comparing the "split after new
tuple" optimization to the traditional leaf fillfactor of 90, which is
a very similar situation. Why should it be 90? Why not 85, or 95? Why
is it okay to assume that the rightmost page shouldn't be split 50/50?

The answers to all of these questions about the well established idea
of a leaf fillfactor boil down to this: it's very likely to be correct
on average, and when it isn't correct the problem is self-limiting,
and has an infinitesimally small chance of continually recurring
(unless you imagine an *adversarial* case). Similarly, it doesn't
matter if these new heuristics get it wrong once every 1000 splits (a
very pessimistic estimate), because even then those will cancel each
other out in the long run. It is necessary to take a holistic view of
things. We're talking about an optimization that makes the two largest
TPC-C indexes over 40% smaller -- I can hold my nose if I must in
order to get that benefit. There were also a couple of indexes in the
real-world mouse genome database that this made much smaller, so this
will clearly help in the real world.

Besides all this, the "split after new tuple" optimization fixes an
existing worst case, rather than being an optimization, at least in my
mind. It's not supposed to be possible to have leaf pages that are all
only 50% full without any deletes, and yet we allow it to happen in
this one weird case. Even completely random insertions result in 65% -
70% average space utilization, so the existing worst case really is
exceptional. We are forced to take a holistic view, and infer
something about the pattern of insertions over time, even though
holistic is a dirty word.

> > (New header comment block over _bt_findsplitloc())
>
> This is pretty good, but I think some copy-editing can make this even
> better

I've restored the old structure of the _bt_findsplitloc() header comments.

> The explanation of how the high key for the left page is formed (5th
> paragraph), seems out-of-place here, because the high key is not formed
> here.

Moved that to _bt_split(), which seems like a good compromise.

> Somewhere in the 1st or 2nd paragraph, perhaps we should mention that
> the function effectively uses a different fillfactor in some other
> scenarios too, not only when it's the rightmost page.

Done.

> > state.maxsplits = maxoff + 2;
> > state.splits = palloc(Max(BLCKSZ, sizeof(SplitPoint) * state.maxsplits));
> > state.nsplits = 0;
>
> I wouldn't be that paranoid. The code that populates the array is pretty
> straightforward.

Done that way. But are you sure? Some of the attempts to create a new
split point are bound to fail, because they try to put everything
(including new item) on one size of the split. I'll leave the
assertion there.

> > * Still, it's typical for almost all calls to _bt_recordsplit to
> > * determine that the split is legal, and therefore enter it into the
> > * candidate split point array for later consideration.
> > */
>
> Suggestion: Remove the "Still" word. The observation that typically all
> split points are legal is valid, but it seems unrelated to the first
> paragraph. (Do we need to mention it at all?)

Removed the second paragraph entirely.

> > /*
> > * If the new item goes as the last item, record the split point that
> > * leaves all the old items on the left page, and the new item on the
> > * right page. This is required because a split that leaves the new item
> > * as the firstoldonright won't have been reached within the loop. We
> > * always record every possible split point.
> > */
>
> Suggestion: Remove the last sentence.

Agreed. Removed.

> ISTM that figuring out which "mode" we want to operate in is actually
> the *primary* purpose of _bt_perfect_penalty(). We only really use the
> penalty as an optimization that we pass on to _bt_bestsplitloc(). So I'd
> suggest changing the function name to something like _bt_choose_mode(),
> and have secondmode be the primary return value from it, with
> perfectpenalty being the secondary result through a pointer.

I renamed _bt_perfect_penalty() to _bt_strategy(), since I agree that
its primary purpose is to decide on a strategy (which is what I'm now
calling a mode, per your request a bit further down). It still returns
perfectpenalty, though, since that seemed more natural to me, probably
because its style matches the style of caller/_bt_findsplitloc().
perfectpenalty isn't a mere optimization -- it is important to prevent
many duplicates mode from going overboard with suffix truncation. It
does more than just save _bt_bestsplitloc() cycles, which I've tried
to make clearer in v14.

> It doesn't really choose the mode, either, though. At least after the
> next "Add split after new tuple optimization" patch. The caller has a
> big part in choosing what to do. So maybe split _bt_perfect_penalty into
> two functions: _bt_perfect_penalty(), which just computes the perfect
> penalty, and _bt_analyze_split_interval(), which determines how many
> duplicates there are in the top-N split points.

Hmm. I didn't create a _bt_analyze_split_interval(), because now
_bt_perfect_penalty()/_bt_strategy() is responsible for setting the
perfect penalty in all cases. It was a mistake for me to move some
perfect penalty stuff for alternative modes/strategies out to the
caller in v13. In v14, we never make _bt_findsplitloc() change its
perfect penalty -- it only changes its split interval, based on the
strategy/mode, possibly after sorting. Let me know what you think of
this.

> BTW, I like the word "strategy", like you called it in the comment on
> SplitPoint struct, better than "mode".

I've adopted that terminology in v14 -- it's always "strategy", never "mode".

> How about removing the "usemult" variable, and just check if
> fillfactormult == 0.5?

I need to use "usemult" to determine if the "split after new tuple"
optimization should apply leaf fillfactor, or should instead split at
the exact point after the newly inserted tuple -- it's very natural to
have a single bool flag for that. It's seems simpler to continue to
use "usemult" for everything, and not distinguish "split after new
tuple" as a special case later on. (Besides, the master branch already
uses a bool for this, even though it handles far fewer things.)

> > /*
> > * There are a much smaller number of candidate split points when
> > * splitting an internal page, so we can afford to be exhaustive. Only
> > * give up when pivot that will be inserted into parent is as small as
> > * possible.
> > */
> > if (!state->is_leaf)
> > return MAXALIGN(sizeof(IndexTupleData) + 1);
>
> Why are there fewer candidate split points on an internal page?

The comment should say that there is typically a much smaller split
interval (this used to be controlled by limiting the size of the array
initially -- should have updated this for v13 of the patch). I believe
that you understand that, and are interested in why the split interval
itself is different on internal pages. Or why we are more conservative
with internal pages in general. Assuming that's what you meant, here
is my answer:

The "Prefix B-Tree" paper establishes the idea that there are
different split intervals for leaf pages and internal pages (which it
calls branch pages). We care about different things in each case. With
leaf pages, we care about choosing the split point that allows us to
create the smallest possible pivot tuple as our secondary goal
(primary goal is balancing space). With internal pages, we care about
choosing the smallest tuple to insert into parent of internal page
(often the root) as our secondary goal, but don't care about
truncation, because _bt_split() won't truncate new pivot. That's why
the definition of "penalty" varies according to whether we're
splitting an internal page or a leaf page. Clearly the idea of having
separate split intervals is well established, and makes sense.

It's fair to ask if I'm being too conservative (or not conservative
enough) with split interval in either case. Unfortunately, the Prefix
B-Tree paper never seems to give practical advice about how to come up
with an interval. They say:

"We have not analyzed the influence of sigma L [leaf interval] or
sigma B [branch/internal interval] on the performance of the trees. We
expect such an analysis to be quite involved and difficult. We are
quite confident, however, that small split intervals improve
performance considerably. Sets of keys that arise in practical
applications are often far from random, and clusters of similar keys
differing only in the last few letters (e.g. plural forms) are quite
common."

I am aware of another, not-very-notable paper that tries to to impose
some theory here, but doesn't really help much [1]. Anyway, I've found
that I was too conservative with split interval for internal pages. It
pays to make internal interval that higher than leaf interval, because
internal pages cover a much bigger portion of the key space than leaf
pages, which will tend to get filled up one way or another. Leaf pages
cover a tight part of the key space, in contrast. In v14, I've
increased internal page to 18, a big increase from 3, and twice what
it is for leaf splits (still 9 -- no change there). This mostly isn't
that different to 3, since there usually are pivot tuples that are all
the same size anyway. However, with cases where suffix truncation
makes pivot tuples a lot smaller (e.g. UK land registry test case),
this makes the items in the root a lot smaller on average, and even
makes the whole index smaller. My entire test suite has a few cases
that are noticeably improved by this change, and no cases that are
hurt.

I'm going to have to revalidate the performance of long-running
benchmarks with this change, so this should be considered provisional.
I think that it will probably be kept, though. Not expecting it to
noticeably impact either BenchmarkSQL or pgbench benchmarks.

[1] https://shareok.org/bitstream/handle/11244/16442/Thesis-1983-T747e.pdf?sequence=1
--
Peter Geoghegan

Attachment Content-Type Size
v14-0001-Refactor-nbtree-insertion-scankeys.patch application/octet-stream 55.3 KB
v14-0003-Consider-secondary-factors-during-nbtree-splits.patch application/octet-stream 49.6 KB
v14-0004-Add-split-after-new-tuple-optimization.patch application/octet-stream 12.1 KB
v14-0005-Add-high-key-continuescan-optimization.patch application/octet-stream 8.6 KB
v14-0002-Make-heap-TID-a-tie-breaker-nbtree-index-column.patch application/octet-stream 158.7 KB
v14-0006-Allow-tuples-to-be-relocated-from-root-by-amchec.patch application/octet-stream 16.9 KB
v14-0007-DEBUG-Add-pageinspect-instrumentation.patch application/octet-stream 7.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2019-03-04 21:28:40 Re: POC: converting Lists into arrays
Previous Message Tom Lane 2019-03-04 21:13:41 Re: Allowing extensions to supply operator-/function-specific info