Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Cc: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>
Subject: Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
Date: 2018-09-19 18:23:11
Message-ID: CAH2-WzkfK=JVHjkd17TLDvsFb6psduTt5WYiT8dg+-UFc+rSSQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Attached is v5, which significantly simplifies the _bt_findsplitloc()
logic. It's now a great deal easier to follow. It would be helpful if
someone could do code-level review of the overhauled
_bt_findsplitloc(). That's the most important part of the patch. It
involves relatively subjective trade-offs around total effort spent
during a page split, space utilization, and avoiding "false sharing"
(I call the situation where a range of duplicate values straddle two
leaf pages unnecessarily "false sharing", since it necessitates that
subsequent index scans visit two index scans rather than just one,
even when that's avoidable.)

This version has slightly improved performance, especially for cases
where an index gets bloated without any garbage being generated. With
the UK land registry data [1], an index on (county, city, locality) is
shrunk by just over 18% by the new logic (I recall that it was shrunk
by ~15% in an earlier version). In concrete terms, it goes from being
1.288 GiB on master to being 1.054 GiB with v5 of the patch. This is
mostly because the patch intelligently packs together duplicate-filled
pages tightly (in particular, it avoids "getting tired"), but also
because it makes pivots less restrictive about where leaf tuples can
go. I still manage to shrink the largest TPC-C and TPC-H indexes by at
least 5% following an initial load performed by successive INSERTs.
Those are unique indexes, so the benefits are certainly not limited to
cases involving many duplicates.

3 modes
-------

My new approach is to teach _bt_findsplitloc() 3 distinct modes of
operation: Regular/default mode, many duplicates mode, and single
value mode. The higher level split code always asks for a default mode
call to _bt_findsplitloc(), so that's always where we start. For leaf
page splits, _bt_findsplitloc() will occasionally call itself
recursively in either many duplicates mode or single value mode. This
happens when the default strategy doesn't work out.

* Default mode almost does what we do already, but remembers the top n
candidate split points, sorted by the delta between left and right
post-split free space, rather than just looking for the overall lowest
delta split point.

Then, we go through a second pass over the temp array of "acceptable"
split points, that considers the needs of suffix truncation.

* Many duplicates mode is used when we fail to find a "distinguishing"
split point in regular mode, but have determined that it's possible to
get one if a new, exhaustive search is performed.

We go to great lengths to avoid having to append a heap TID to the new
left page high key -- that's what I mean by "distinguishing". We're
particularly concerned with false sharing by subsequent point lookup
index scans here.

* Single value mode is used when we see that even many duplicates mode
would be futile, as the leaf page is already *entirely* full of
logical duplicates.

Single value mode isn't exhaustive, since there is clearly nothing to
exhaustively search for. Instead, it packs together as many tuples as
possible on the right side of the split. Since heap TIDs sort in
descending order, this is very much like a "leftmost" split that tries
to free most of the space on the left side, and pack most of the page
contents on the right side. Except that it's leftmost, and in
particular is leftmost among pages full of logical duplicates (as
opposed to being leftmost/rightmost among pages on an entire level of
the tree, as with the traditional rightmost 90:10 split thing).

Other changes
-------------

* I now explicitly use fillfactor in the manner of a rightmost split
to get the single value mode behavior.

I call these types of splits (rightmost and single value mode splits)
"weighted" splits in the patch. This is much more consistent with our
existing conventions than my previous approach.

* Improved approached to inexpensively determining how effective
suffix truncation will be for a given candidate split point.

I no longer naively probe the contents of index tuples to do char
comparisons. Instead, I use a tuple descriptor to get offsets to each
attribute in each tuple in turn, then calling to datumIsEqual() to
determine if they're equal. This is almost as good as a full scan key
comparison. This actually seems to be a bit faster, and also takes
care of INCLUDE indexes without special care (no need to worry about
probing non-key attributes, and reaching a faulty conclusion about
which split point helps with suffix truncation).

I still haven't managed to add pg_upgrade support, but that's my next
step. I am more or less happy with the substance of the patch in v5,
and feel that I can now work backwards towards figuring out the best
way to deal with on-disk compatibility. It shouldn't be too hard --
most of the effort will involve coming up with a good test suite.

[1] https://wiki.postgresql.org/wiki/Sample_Databases
--
Peter Geoghegan

Attachment Content-Type Size
v5-0003-Allow-nbtree-to-use-ASC-heap-TID-attribute-order.patch application/octet-stream 6.8 KB
v5-0002-Add-temporary-page-split-debug-LOG-instrumentatio.patch application/octet-stream 4.1 KB
v5-0001-Make-nbtree-indexes-have-unique-keys-in-tuples.patch application/octet-stream 134.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fred Pratt 2018-09-19 18:34:51 Re: Code of Conduct
Previous Message Stephen Frost 2018-09-19 18:07:37 Re: Code of Conduct