Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>
Subject: Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
Date: 2019-03-03 13:41:29
Message-ID: 5467deea-53f1-65ac-1828-40af68b6d940@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 26/02/2019 12:31, Peter Geoghegan wrote:
> On Mon, Jan 28, 2019 at 7:32 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>> I spent some time first trying to understand the current algorithm, and
>> then rewriting it in a way that I find easier to understand. I came up
>> with the attached. I think it optimizes for the same goals as your
>> patch, but the approach is quite different.
>
> Attached is v13 of the patch series, which significantly refactors
> nbtsplitloc.c to implement the algorithm using the approach from your
> prototype posted on January 28 -- I now take a "top down" approach
> that materializes all legal split points up-front, as opposed to the
> initial "bottom up" approach that used recursion, and weighed
> everything (balance of free space, suffix truncation, etc) all at
> once.

Great, looks much simpler now, indeed! Now I finally understand the
algorithm.

> I'm using qsort() to sort the candidate split points array. I'm not
> trying to do something clever to avoid the up-front effort of sorting
> everything, even though we could probably get away with that much of
> the time (e.g. by doing a top-N sort in default mode). Testing has
> shown that using an inlined qsort() routine in the style of
> tuplesort.c would make my serial test cases/microbenchmarks faster,
> without adding much complexity. We're already competitive with the
> master branch even without this microoptimization, so I've put that
> off for now. What do you think of the idea of specializing an
> inlineable qsort() for nbtsplitloc.c?

If the performance is acceptable without it, let's not bother. We can
optimize later.

What would be the worst case scenario for this? Splitting a page that
has as many tuples as possible, I guess, so maybe inserting into a table
with a single-column index, with 32k BLCKSZ. Have you done performance
testing on something like that?

> Unlike in your prototype, v13 makes the array for holding candidate
> split points into a single big allocation that is always exactly
> BLCKSZ. The idea is that palloc() can thereby recycle the big
> _bt_findsplitloc() allocation within _bt_split(). palloc() considers
> 8KiB to be the upper limit on the size of individual blocks it
> manages, and we'll always go on to palloc(BLCKSZ) through the
> _bt_split() call to PageGetTempPage(). In a sense, we're not even
> allocating memory that we weren't allocating already. (Not sure that
> this really matters, but it is easy to do it that way.)

Rounding up the allocation to BLCKSZ seems like a premature
optimization. Even if it saved some cycles, I don't think it's worth the
trouble of having to explain all that in the comment.

I think you could change the curdelta, leftfree, and rightfree fields in
SplitPoint to int16, to make the array smaller.

> Other changes from your prototype:
>
> * I found a more efficient representation than a pair of raw
> IndexTuple pointers for each candidate split. Actually, I use the same
> old representation (firstoldonright + newitemonleft) in each split,
> and provide routines to work backwards from that to get the lastleft
> and firstright tuples. This approach is far more space efficient, and
> space efficiency matters when you've allocating space for hundreds of
> items in a critical path like this.

Ok.

> * You seemed to refactor _bt_checksplitloc() in passing, making it not
> do the newitemisfirstonright thing. I changed that back. Did I miss
> something that you intended here?

My patch treated the new item the same as all the old items, in
_bt_checksplitloc(), so it didn't need newitemisonright. You still need
it with your approach.

> Changes to my own code since v12:
>
> * Simplified "Add "split after new tuple" optimization" commit, and
> made it more consistent with associated code. This is something that
> was made a lot easier by the new approach. It would be great to hear
> what you think of this.

I looked at it very briefly. Yeah, it's pretty simple now. Nice!

About this comment on _bt_findsplit_loc():

>/*
> * _bt_findsplitloc() -- find an appropriate place to split a page.
> *
> * The main goal here is to equalize the free space that will be on each
> * split page, *after accounting for the inserted tuple*. (If we fail to
> * account for it, we might find ourselves with too little room on the page
> * that it needs to go into!)
> *
> * If the page is the rightmost page on its level, we instead try to arrange
> * to leave the left split page fillfactor% full. In this way, when we are
> * inserting successively increasing keys (consider sequences, timestamps,
> * etc) we will end up with a tree whose pages are about fillfactor% full,
> * instead of the 50% full result that we'd get without this special case.
> * This is the same as nbtsort.c produces for a newly-created tree. Note
> * that leaf and nonleaf pages use different fillfactors.
> *
> * We are passed the intended insert position of the new tuple, expressed as
> * the offsetnumber of the tuple it must go in front of (this could be
> * maxoff+1 if the tuple is to go at the end). The new tuple itself is also
> * passed, since it's needed to give some weight to how effective suffix
> * truncation will be. The implementation picks the split point that
> * maximizes the effectiveness of suffix truncation from a small list of
> * alternative candidate split points that leave each side of the split with
> * about the same share of free space. Suffix truncation is secondary to
> * equalizing free space, except in cases with large numbers of duplicates.
> * Note that it is always assumed that caller goes on to perform truncation,
> * even with pg_upgrade'd indexes where that isn't actually the case
> * (!heapkeyspace indexes). See nbtree/README for more information about
> * suffix truncation.
> *
> * We return the index of the first existing tuple that should go on the
> * righthand page, plus a boolean indicating whether the new tuple goes on
> * the left or right page. The bool is necessary to disambiguate the case
> * where firstright == newitemoff.
> *
> * The high key for the left page is formed using the first item on the
> * right page, which may seem to be contrary to Lehman & Yao's approach of
> * using the left page's last item as its new high key on the leaf level.
> * It isn't, though: suffix truncation will leave the left page's high key
> * fully equal to the last item on the left page when two tuples with equal
> * key values (excluding heap TID) enclose the split point. It isn't
> * necessary for a new leaf high key to be equal to the last item on the
> * left for the L&Y "subtree" invariant to hold. It's sufficient to make
> * sure that the new leaf high key is strictly less than the first item on
> * the right leaf page, and greater than the last item on the left page.
> * When suffix truncation isn't possible, L&Y's exact approach to leaf
> * splits is taken (actually, a tuple with all the keys from firstright but
> * the heap TID from lastleft is formed, so as to not introduce a special
> * case).
> *
> * Starting with the first right item minimizes the divergence between leaf
> * and internal splits when checking if a candidate split point is legal.
> * It is also inherently necessary for suffix truncation, since truncation
> * is a subtractive process that specifically requires lastleft and
> * firstright inputs.
> */

This is pretty good, but I think some copy-editing can make this even
better. If you look at the old comment, it had this structure:

1. Explain what the function does.
2. Explain the arguments
3. Explain the return value.

The additions to this comment broke the structure. The explanations of
argument and return value are now in the middle, in 3rd and 4th
paragraphs. And the 3rd paragraph that explains the arguments, now also
goes into detail on what the function does with it. I'd suggest moving
things around to restore the old structure, that was more clear.

The explanation of how the high key for the left page is formed (5th
paragraph), seems out-of-place here, because the high key is not formed
here.

Somewhere in the 1st or 2nd paragraph, perhaps we should mention that
the function effectively uses a different fillfactor in some other
scenarios too, not only when it's the rightmost page.

In the function itself:

> * maxsplits should never exceed maxoff because there will be at most as
> * many candidate split points as there are points _between_ tuples, once
> * you imagine that the new item is already on the original page (the
> * final number of splits may be slightly lower because not all splits
> * will be legal). Even still, add space for an extra two splits out of
> * sheer paranoia.
> */
> state.maxsplits = maxoff + 2;
> state.splits = palloc(Max(BLCKSZ, sizeof(SplitPoint) * state.maxsplits));
> state.nsplits = 0;

I wouldn't be that paranoid. The code that populates the array is pretty
straightforward.

> /*
> * Scan through the data items and calculate space usage for a split at
> * each possible position. We start at the first data offset rather than
> * the second data offset to handle the "newitemoff == first data offset"
> * case (otherwise, a split whose firstoldonright is the first data offset
> * can't be legal, and won't actually end up being recorded by
> * _bt_recordsplit).
> *
> * Still, it's typical for almost all calls to _bt_recordsplit to
> * determine that the split is legal, and therefore enter it into the
> * candidate split point array for later consideration.
> */

Suggestion: Remove the "Still" word. The observation that typically all
split points are legal is valid, but it seems unrelated to the first
paragraph. (Do we need to mention it at all?)

> /*
> * If the new item goes as the last item, record the split point that
> * leaves all the old items on the left page, and the new item on the
> * right page. This is required because a split that leaves the new item
> * as the firstoldonright won't have been reached within the loop. We
> * always record every possible split point.
> */

Suggestion: Remove the last sentence. An earlier comment already said
that we calculate space usage for a split at each possible position,
that seems sufficient. Like it was before this patch.

> /*
> * Find lowest possible penalty among split points currently regarded as
> * acceptable -- the "perfect" penalty. The perfect penalty often saves
> * _bt_bestsplitloc() additional work around calculating penalties. This
> * is also a convenient point to determine if default mode worked out, or
> * if we should instead reassess which split points should be considered
> * acceptable (split interval, and possibly fillfactormult).
> */
> perfectpenalty = _bt_perfect_penalty(rel, page, &state, newitemoff,
> newitem, &secondmode);

ISTM that figuring out which "mode" we want to operate in is actually
the *primary* purpose of _bt_perfect_penalty(). We only really use the
penalty as an optimization that we pass on to _bt_bestsplitloc(). So I'd
suggest changing the function name to something like _bt_choose_mode(),
and have secondmode be the primary return value from it, with
perfectpenalty being the secondary result through a pointer.

It doesn't really choose the mode, either, though. At least after the
next "Add split after new tuple optimization" patch. The caller has a
big part in choosing what to do. So maybe split _bt_perfect_penalty into
two functions: _bt_perfect_penalty(), which just computes the perfect
penalty, and _bt_analyze_split_interval(), which determines how many
duplicates there are in the top-N split points.

BTW, I like the word "strategy", like you called it in the comment on
SplitPoint struct, better than "mode".

> + if (usemult)
> + delta = fillfactormult * split->leftfree -
> + (1.0 - fillfactormult) * split->rightfree;
> + else
> + delta = split->leftfree - split->rightfree;
>

How about removing the "usemult" variable, and just check if
fillfactormult == 0.5?

> /*
> * There are a much smaller number of candidate split points when
> * splitting an internal page, so we can afford to be exhaustive. Only
> * give up when pivot that will be inserted into parent is as small as
> * possible.
> */
> if (!state->is_leaf)
> return MAXALIGN(sizeof(IndexTupleData) + 1);

Why are there fewer candidate split points on an internal page?

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Pryzby 2019-03-03 14:51:05 Re: Question about pg_upgrade from 9.2 to X.X
Previous Message David Rowley 2019-03-03 13:34:42 Re: NOT IN subquery optimization