Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>
Subject: Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
Date: 2018-12-28 18:04:37
Message-ID: 8c54b249-1483-76bc-dad7-e6d65c4d0ca8@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 04/12/2018 05:10, Peter Geoghegan wrote:
> Attached is v9, ...

I spent some time reviewing this. I skipped the first patch, to add a
column to pg_depend, and I got through patches 2, 3 and 4. Impressive
results, and the code looks sane.

I wrote a laundry list of little comments on minor things, suggested
rewordings of comments etc. I hope they're useful, but feel free to
ignore/override my opinions of any of those, as you see best.

But first, a few slightly bigger (medium-sized?) issues that caught my eye:

1. How about doing the BTScanInsertData refactoring as a separate
commit, first? It seems like a good thing for readability on its own,
and would slim the big main patch. (And make sure to credit Andrey for
that idea in the commit message.)

2. In the "Treat heap TID as part of the nbtree key space" patch:

> * Build an insertion scan key that contains comparison data from itup
> * as well as comparator routines appropriate to the key datatypes.
> *
> + * When itup is a non-pivot tuple, the returned insertion scan key is
> + * suitable for finding a place for it to go on the leaf level. When
> + * itup is a pivot tuple, the returned insertion scankey is suitable
> + * for locating the leaf page with the pivot as its high key (there
> + * must have been one like it at some point if the pivot tuple
> + * actually came from the tree).
> + *
> + * Note that we may occasionally have to share lock the metapage, in
> + * order to determine whether or not the keys in the index are expected
> + * to be unique (i.e. whether or not heap TID is treated as a tie-breaker
> + * attribute). Callers that cannot tolerate this can request that we
> + * assume that this is a heapkeyspace index.
> + *
> * The result is intended for use with _bt_compare().
> */
> -ScanKey
> -_bt_mkscankey(Relation rel, IndexTuple itup)
> +BTScanInsert
> +_bt_mkscankey(Relation rel, IndexTuple itup, bool assumeheapkeyspace)

This 'assumeheapkeyspace' flag feels awkward. What if the caller knows
that it is a v3 index? There's no way to tell _bt_mkscankey() that.
(There's no need for that, currently, but seems a bit weird.)

_bt_split() calls _bt_truncate(), which calls _bt_leave_natts(), which
calls _bt_mkscankey(). It's holding a lock on the page being split. Do
we risk deadlock by locking the metapage at the same time?

I don't have any great ideas on what to do about this, but it's awkward
as it is. Can we get away without the new argument? Could we somehow
arrange things so that rd_amcache would be guaranteed to already be set?

3. In the "Pick nbtree split points discerningly" patch

I find the different modes and the logic in _bt_findsplitloc() very hard
to understand. I've spent a while looking at it now, and I think I have
a vague understanding of what things it takes into consideration, but I
don't understand why it performs those multiple stages, what each stage
does, and how that leads to an overall strategy. I think a rewrite would
be in order, to make that more understandable. I'm not sure what exactly
it should look like, though.

If _bt_findsplitloc() has to fall back to the MANY_DUPLICATES or
SINGLE_VALUE modes, it has to redo a lot of the work that was done in
the DEFAULT mode already. That's probably not a big deal in practice,
performance-wise, but I feel that it's another hint that some
refactoring would be in order.

One idea on how to restructure that:

Make a single pass over all the offset numbers, considering a split at
that location. Like the current code does. For each offset, calculate a
"penalty" based on two factors:

* free space on each side
* the number of attributes in the pivot tuple, and whether it needs to
store the heap TID

Define the penalty function so that having to add a heap TID to the
pivot tuple is considered very expensive, more expensive than anything
else, and truncating away other attributes gives a reward of some size.

However, naively computing the penalty upfront for every offset would be
a bit wasteful. Instead, start from the middle of the page, and walk
"outwards" towards both ends, until you find a "good enough" penalty.

Or something like that...

Now, the laundry list of smaller items:

----- laundry list begins -----

1st commits commit message:

> Make nbtree treat all index tuples as having a heap TID trailing key
> attribute. Heap TID becomes a first class part of the key space on all
> levels of the tree. Index searches can distinguish duplicates by heap
> TID, at least in principle.

What do you mean by "at least in principle"?

> Secondary index insertions will descend
> straight to the leaf page that they'll insert on to (unless there is a
> concurrent page split).

What is a "Secondary" index insertion?

> Naively adding a new attribute to every pivot tuple has unacceptable
> overhead (it bloats internal pages), so suffix truncation of pivot
> tuples is added. This will generally truncate away the "extra" heap TID
> attribute from pivot tuples during a leaf page split, and may also
> truncate away additional user attributes. This can increase fan-out,
> especially when there are several attributes in an index.

Suggestion: "when there are several attributes in an index" -> "in a
multi-column index"

> +/*
> + * Convenience macro to get number of key attributes in tuple in low-context
> + * fashion
> + */
> +#define BTreeTupleGetNKeyAtts(itup, rel) \
> + Min(IndexRelationGetNumberOfKeyAttributes(rel), BTreeTupleGetNAtts(itup, rel))
> +

What is "low-context fashion"?

> + * scankeys is an array of scan key entries for attributes that are compared
> + * before scantid (user-visible attributes). Every attribute should have an
> + * entry during insertion, though not necessarily when a regular index scan
> + * uses an insertion scankey to find an initial leaf page.

Suggestion: Reword to something like "During insertion, there must be a
scan key for every attribute, but when starting a regular index scan,
some can be omitted."

> +typedef struct BTScanInsertData
> +{
> + /*
> + * Mutable state used by _bt_binsrch() to inexpensively repeat a binary
> + * search on the leaf level when only scantid has changed. Only used for
> + * insertions where _bt_check_unique() is called.
> + */
> + bool savebinsrch;
> + bool restorebinsrch;
> + OffsetNumber low;
> + OffsetNumber high;
> +
> + /* State used to locate a position at the leaf level */
> + bool heapkeyspace;
> + bool nextkey;
> + ItemPointer scantid; /* tiebreaker for scankeys */
> + int keysz; /* Size of scankeys */
> + ScanKeyData scankeys[INDEX_MAX_KEYS]; /* Must appear last */
> +} BTScanInsertData;

It would feel more natural to me, to have the mutable state *after* the
other fields. Also, it'd feel less error-prone to have 'scantid' be
ItemPointerData, rather than a pointer to somewhere else. The
'heapkeyspace' name isn't very descriptive. I understand that it means
that the heap TID is part of the keyspace. Not sure what to suggest
instead, though.

> +The requirement that all btree keys be unique is satisfied by treating heap
> +TID as a tiebreaker attribute. Logical duplicates are sorted in heap item
> +pointer order.

Suggestion: "item pointer" -> TID, to use consistent terms.

> We don't use btree keys to disambiguate downlinks from the
> +internal pages during a page split, though: only one entry in the parent
> +level will be pointing at the page we just split, so the link fields can be
> +used to re-find downlinks in the parent via a linear search. (This is
> +actually a legacy of when heap TID was not treated as part of the keyspace,
> +but it does no harm to keep things that way.)

I don't understand this paragraph.

> +Lehman and Yao talk about pairs of "separator" keys and downlinks in
> +internal pages rather than tuples or records. We use the term "pivot"
> +tuple to distinguish tuples which don't point to heap tuples, that are
> +used only for tree navigation. Pivot tuples include all tuples on
> +non-leaf pages and high keys on leaf pages.

Suggestion: reword to "All tuples on non-leaf pages, and high keys on
leaf pages, are pivot tuples"

> Note that pivot tuples are
> +only used to represent which part of the key space belongs on each page,
> +and can have attribute values copied from non-pivot tuples that were
> +deleted and killed by VACUUM some time ago. A pivot tuple may contain a
> +"separator" key and downlink, just a separator key (in practice the
> +downlink will be garbage), or just a downlink.

Rather than store garbage, set it to zeros?

> +Lehman and Yao require that the key range for a subtree S is described by
> +Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent page.
> +A search where the scan key is equal to a pivot tuple in an upper tree
> +level must descend to the left of that pivot to ensure it finds any equal
> +keys. Pivot tuples are always a _strict_ lower bound on items on their
> +downlink page; the equal item(s) being searched for must therefore be to
> +the left of that downlink page on the next level down. (It's possible to
> +arrange for internal page tuples to be strict lower bounds in all cases
> +because their values come from leaf tuples, which are guaranteed unique by
> +the use of heap TID as a tiebreaker. We also make use of hard-coded
> +negative infinity values in internal pages. Rightmost pages don't have a
> +high key, though they conceptually have a positive infinity high key). A
> +handy property of this design is that there is never any need to
> +distinguish between equality in the case where all attributes/keys are used
> +in a scan from equality where only some prefix is used.

"distringuish between ... from ..." doesn't sound like correct grammar.
Suggestion: "distinguish between ... and ...", or just "distinguish ...
from ...". Or rephrase the sentence some other way.

> +We truncate away suffix key attributes that are not needed for a page high
> +key during a leaf page split when the remaining attributes distinguish the
> +last index tuple on the post-split left page as belonging on the left page,
> +and the first index tuple on the post-split right page as belonging on the
> +right page.

That's a very long sentence.

> * Since the truncated tuple is probably smaller than the
> * original, it cannot just be copied in place (besides, we want
> * to actually save space on the leaf page). We delete the
> * original high key, and add our own truncated high key at the
> * same offset. It's okay if the truncated tuple is slightly
> * larger due to containing a heap TID value, since pivot tuples
> * are treated as a special case by _bt_check_third_page().

By "treated as a special case", I assume that _bt_check_third_page()
always reserves some space for that? Maybe clarify that somehow.

_bt_truncate():
> This is possible when there are
> * attributes that follow an attribute in firstright that is not equal to the
> * corresponding attribute in lastleft (equal according to insertion scan key
> * semantics).

I can't comprehend that sentence. Simpler English, maybe add an example,
please.

> /*
> * _bt_leave_natts - how many key attributes to leave when truncating.
> *
> * Caller provides two tuples that enclose a split point. CREATE INDEX
> * callers must pass build = true so that we may avoid metapage access. (This
> * is okay because CREATE INDEX always creates an index on the latest btree
> * version.)
> *
> * This can return a number of attributes that is one greater than the
> * number of key attributes for the index relation. This indicates that the
> * caller must use a heap TID as a unique-ifier in new pivot tuple.
> */
> static int
> _bt_leave_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
> bool build)

IMHO "keep" would sound better here than "leave".

> + if (needheaptidspace)
> + ereport(ERROR,
> + (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
> + errmsg("index row size %zu exceeds btree version %u maximum %zu for index \"%s\"",
> + itemsz, BTREE_VERSION, BTMaxItemSize(page),
> + RelationGetRelationName(rel)),
> + errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
> + ItemPointerGetBlockNumber(&newtup->t_tid),
> + ItemPointerGetOffsetNumber(&newtup->t_tid),
> + RelationGetRelationName(heap)),
> + errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
> + "Consider a function index of an MD5 hash of the value, "
> + "or use full text indexing."),
> + errtableconstraint(heap,
> + RelationGetRelationName(rel))));
> + else
> + ereport(ERROR,
> + (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
> + errmsg("index row size %zu exceeds btree version 3 maximum %zu for index \"%s\"",
> + itemsz, BTMaxItemSizeNoHeapTid(page),
> + RelationGetRelationName(rel)),
> + errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
> + ItemPointerGetBlockNumber(&newtup->t_tid),
> + ItemPointerGetOffsetNumber(&newtup->t_tid),
> + RelationGetRelationName(heap)),
> + errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
> + "Consider a function index of an MD5 hash of the value, "
> + "or use full text indexing."),
> + errtableconstraint(heap,
> + RelationGetRelationName(rel))));

Could restructure this to avoid having two almost identical strings to
translate.

> #define BTREE_METAPAGE 0 /* first page is meta */
> #define BTREE_MAGIC 0x053162 /* magic number of btree pages */
> -#define BTREE_VERSION 3 /* current version number */
> +#define BTREE_VERSION 4 /* current version number */
> #define BTREE_MIN_VERSION 2 /* minimal supported version number */
> +#define BTREE_META_VERSION 3 /* minimal version with all meta fields */

BTREE_META_VERSION is a strange name for version 3. I think this
deserves a more verbose comment, above these #defines, to list all the
versions and their differences.

v9-0003-Pick-nbtree-split-points-discerningly.patch commit message:
> Add infrastructure to determine where the earliest difference appears
> among a pair of tuples enclosing a candidate split point.

I don't understand this sentence.

> _bt_findsplitloc() is also taught to care about the case where there are
> many duplicates, making it hard to find a distinguishing split point.
> _bt_findsplitloc() may even conclude that it isn't possible to avoid
> filling a page entirely with duplicates, in which case it packs pages
> full of duplicates very tightly.

Hmm. Is the assumption here that if a page is full of duplicates, there
will be no more insertions into that page? Why?

> The number of cycles added is not very noticeable, which is important,
> since _bt_findsplitloc() is run while an exclusive (leaf page) buffer
> lock is held. We avoid using authoritative insertion scankey
> comparisons, unlike suffix truncation proper.

What do you do instead, then? memcmp? (Reading the patch, yes.
Suggestion: "We use a faster binary comparison, instead of proper
datatype-aware comparison, for speed".

Aside from performance, it would feel inappropriate to call user-defined
code while holding a buffer lock, anyway.

> +There is sophisticated criteria for choosing a leaf page split point. The
> +general idea is to make suffix truncation effective without unduly
> +influencing the balance of space for each half of the page split. The
> +choice of leaf split point can be thought of as a choice among points
> +*between* items on the page to be split, at least if you pretend that the
> +incoming tuple was placed on the page already, without provoking a split.

I'd leave out the ", without provoking a split" part. Or maybe reword to
"if you pretend that the incoming tuple fit and was placed on the page
already".

> +Choosing the split point between two index tuples with differences that
> +appear as early as possible results in truncating away as many suffix
> +attributes as possible.

It took me a while to understand what the "appear as early as possible"
means here. It's talking about a multi-column index, and about finding a
difference in one of the leading key columns. Not, for example, about
finding a split point early in the page.

> An array of acceptable candidate split points
> +(points that balance free space on either side of the split sufficiently
> +well) is assembled in a pass over the page to be split, sorted by delta.
> +An optimal split point is chosen during a pass over the assembled array.
> +There are often several split points that allow the maximum number of
> +attributes to be truncated away -- we choose whichever one has the lowest
> +free space delta.

Perhaps we should leave out these details in the README, and explain
this in the comments of the picksplit-function itself? In the README, I
think a more high-level description of what things are taken into
account when picking the split point, would be enough.

> +Suffix truncation is primarily valuable because it makes pivot tuples
> +smaller, which delays splits of internal pages, but that isn't the only
> +reason why it's effective.

Suggestion: reword to "... , but that isn't the only benefit" ?

> There are cases where suffix truncation can
> +leave a B-Tree significantly smaller in size than it would have otherwise
> +been without actually making any pivot tuple smaller due to restrictions
> +relating to alignment.

Suggestion: reword to "... smaller in size than it would otherwise be,
without ..."

and "without making any pivot tuple *physically* smaller, due to alignment".

This sentence is a bit of a cliffhanger: what are those cases, and how
is that possible?

> The criteria for choosing a leaf page split point
> +for suffix truncation is also predictive of future space utilization.

How so? What does this mean?

> +Furthermore, even truncation that doesn't make pivot tuples smaller still
> +prevents pivot tuples from being more restrictive than truly necessary in
> +how they describe which values belong on which pages.

Ok, I guess these sentences resolve the cliffhanger I complained about.
But this still feels like magic. When you split a page, all of the
keyspace must belong on the left or the right page. Why does it make a
difference to space utilization, where exactly you split the key space?

> +While it's not possible to correctly perform suffix truncation during
> +internal page splits, it's still useful to be discriminating when splitting
> +an internal page. The split point that implies a downlink be inserted in
> +the parent that's the smallest one available within an acceptable range of
> +the fillfactor-wise optimal split point is chosen. This idea also comes
> +from the Prefix B-Tree paper. This process has much in common with to what
> +happens at the leaf level to make suffix truncation effective. The overall
> +effect is that suffix truncation tends to produce smaller and less
> +discriminating pivot tuples, especially early in the lifetime of the index,
> +while biasing internal page splits makes the earlier, less discriminating
> +pivot tuples end up in the root page, delaying root page splits.

Ok, so this explains it further, I guess. I find this paragraph
difficult to understand, though. The important thing here is the idea
that some split points are more "discriminating" than others, but I
think it needs some further explanation. What makes a split point more
discriminating? Maybe add an example.

> +Suffix truncation may make a pivot tuple *larger* than the non-pivot/leaf
> +tuple that it's based on (the first item on the right page), since a heap
> +TID must be appended when nothing else distinguishes each side of a leaf
> +split. Truncation cannot simply reuse the leaf level representation: we
> +must append an additional attribute, rather than incorrectly leaving a heap
> +TID in the generic IndexTuple item pointer field. (The field is already
> +used by pivot tuples to store their downlink, plus some additional
> +metadata.)

That's not really the fault of suffix truncation as such, but the
process of turning a leaf tuple into a pivot tuple. It would happen even
if you didn't truncate anything.

I think this point, that we have to store the heap TID differently in
pivot tuples, would deserve a comment somewhere else, too. While reading
the patch, I didn't realize that that's what we're doing, until I read
this part of the README, even though I saw the new code to deal with
heap TIDs elsewhere in the code. Not sure where, maybe in
BTreeTupleGetHeapTID().

> +Adding a heap TID attribute during a leaf page split should only occur when
> +the page to be split is entirely full of duplicates (the new item must also
> +be a duplicate). The logic for selecting a split point goes to great
> +lengths to avoid heap TIDs in pivots --- "many duplicates" mode almost
> +always manages to pick a split point between two user-key-distinct tuples,
> +accepting a completely lopsided split if it must.

This is the first mention of "many duplicates" mode. Maybe just say
"_bt_findsplitloc() almost always ..." or "The logic for selecting a
split point goes to great lengths to avoid heap TIDs in pivots, and
almost always manages to pick a split point between two
user-key-distinct tuples, accepting a completely lopsided split if it must."

> Once appending a heap
> +TID to a split's pivot becomes completely unavoidable, there is a fallback
> +strategy --- "single value" mode is used, which makes page splits pack the
> +new left half full by using a high fillfactor. Single value mode leads to
> +better overall space utilization when a large number of duplicates are the
> +norm, and thereby also limits the total number of pivot tuples with an
> +untruncated heap TID attribute.

This assumes that tuples are inserted in increasing TID order, right?
Seems like a valid assumption, no complaints there, but it's an
assumption nevertheless.

I'm not sure if this level of detail is worthwhile in the README. This
logic on deciding the split point is all within the _bt_findsplitloc()
function, so maybe put this explanation there. In the README, a more
high-level explanation of what things _bt_findsplitloc() considers,
should be enough.

_bt_findsplitloc(), and all its helper structs and subroutines, are
about 1000 lines of code now, and big part of nbtinsert.c. Perhaps it
would be a good idea to move it to a whole new nbtsplitloc.c file? It's
a very isolated piece of code.

In the comment on _bt_leave_natts_fast():

> + * Testing has shown that an approach involving treating the tuple as a
> + * decomposed binary string would work almost as well as the approach taken
> + * here. It would also be faster. It might actually be necessary to go that
> + * way in the future, if suffix truncation is made sophisticated enough to
> + * truncate at a finer granularity (i.e. truncate within an attribute, rather
> + * than just truncating away whole attributes). The current approach isn't
> + * markedly slower, since it works particularly well with the "perfect
> + * penalty" optimization (there are fewer, more expensive calls here). It
> + * also works with INCLUDE indexes (indexes with non-key attributes) without
> + * any special effort.

That's an interesting tidbit, but I'd suggest just removing this comment
altogether. It's not really helping to understand the current
implementation.

v9-0005-Add-high-key-continuescan-optimization.patch commit message:

> Note that even pre-pg_upgrade'd v3 indexes make use of this
> optimization.

.. but we're missing the other optimizations that make it more
effective, so it probably won't do much for v3 indexes. Does it make
them slower? It's probably acceptable, even if there's a tiny
regression, but I'm curious.

----- laundry list ends -----

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2018-12-28 18:06:16 Re: Prepare Transaction support for ON COMMIT DROP temporary tables
Previous Message Alvaro Herrera 2018-12-28 17:52:24 Re: rewrite ExecPartitionCheckEmitError