Making all nbtree entries unique by having heap TIDs participate in comparisons

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>
Subject: Making all nbtree entries unique by having heap TIDs participate in comparisons
Date: 2018-06-14 18:44:46
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

I've been thinking about using heap TID as a tie-breaker when
comparing B-Tree index tuples for a while now [1]. I'd like to make
all tuples at the leaf level unique, as assumed by L&Y. This can
enable "retail index tuple deletion", which I think we'll probably end
up implementing in some form or another, possibly as part of the zheap
project. It's also possible that this work will facilitate GIN-style
deduplication based on run length encoding of TIDs, or storing
versioned heap TIDs in an out-of-line nbtree-versioning structure
(unique indexes only). I can see many possibilities, but we have to
start somewhere.

I attach an unfinished prototype of suffix truncation, that also
sometimes *adds* a new attribute in pivot tuples. It adds an extra
heap TID from the leaf level when truncating away non-distinguishing
attributes during a leaf page split, though only when it must. The
patch also has nbtree treat heap TID as a first class part of the key
space of the index. Claudio wrote a patch that did something similar,
though without the suffix truncation part [2] (I haven't studied his
patch, to be honest). My patch is actually a very indirect spin-off of
Anastasia's covering index patch, and I want to show what I have in
mind now, while it's still swapped into my head. I won't do any
serious work on this project unless and until I see a way to implement
retail index tuple deletion, which seems like a multi-year project
that requires the buy-in of multiple senior community members. On its
own, my patch regresses performance unacceptably in some workloads,
probably due to interactions with kill_prior_tuple()/LP_DEAD hint
setting, and interactions with page space management when there are
many "duplicates" (it can still help performance in some pgbench
workloads with non-unique indexes, though).

Note that the approach to suffix truncation that I've taken isn't even
my preferred approach [3] -- it's a medium-term solution that enables
making a heap TID attribute part of the key space, which enables
everything else. Cheap incremental/retail tuple deletion is the real
prize here; don't lose sight of that when looking through my patch. If
we're going to teach nbtree to truncate this new implicit heap TID
attribute, which seems essential, then we might as well teach nbtree
to do suffix truncation of other (user-visible) attributes while we're
at it. This patch isn't a particularly effective implementation of
suffix truncation, because that's not what I'm truly interested in
improving here (plus I haven't even bothered to optimize the logic for
picking a split point in light of suffix truncation).


This patch adds amcheck coverage, which seems like essential
infrastructure for developing a feature such as this. Extensive
amcheck coverage gave me confidence in my general approach. The basic
idea, invariant-wise, is to treat truncated attributes (often
including a truncated heap TID attribute in internal pages) as "minus
infinity" attributes, which participate in comparisons if and only if
we reach such attributes before the end of the scan key (a smaller
keysz for the index scan could prevent this). I've generalized the
minus infinity concept that _bt_compare() has always considered as a
special case, extending it to individual attributes. It's actually
possible to remove that old hard-coded _bt_compare() logic with this
patch applied without breaking anything, since we can rely on the
comparison of an explicitly 0-attribute tuple working the same way
(pg_upgrade'd databases will break if we do this, however, so I didn't
go that far).

Note that I didn't change the logic that has _bt_binsrch() treat
internal pages in a special way when tuples compare as equal. We still
need that logic for cases where keysz is less than the number of
indexed columns. It's only possible to avoid this _bt_binsrch() thing
for internal pages when all attributes, including heap TID, were
specified and compared (an insertion scan key has to have an entry for
every indexed column, including even heap TID). Doing better there
doesn't seem worth the trouble of teaching _bt_compare() to tell the
_bt_binsrch() caller about this as a special case. That means that we
still move left on equality in some cases where it isn't strictly
necessary, contrary to L&Y. However, amcheck verifies that the classic
"Ki < v <= Ki+1" invariant holds (as opposed to "Ki <= v <= Ki+1")
when verifying parent/child relationships, which demonstrates that I
have restored the classic invariant (I just don't find it worthwhile
to take advantage of it within _bt_binsrch() just yet).

Most of this work was done while I was an employee of VMware, though I
joined Crunchy data on Monday and cleaned it up a bit more since then.
I'm excited about joining Crunchy, but I should also acknowledge
VMware's strong support of my work.

Peter Geoghegan

Attachment Content-Type Size
0001-Ensure-nbtree-leaf-tuple-keys-are-always-unique.patch application/octet-stream 98.7 KB


Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2018-06-14 18:46:37 Re: WAL prefetch
Previous Message Stephen Frost 2018-06-14 18:23:51 Re: commitfest 2018-07