Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>
Subject: Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
Date: 2018-10-03 23:39:24
Message-ID: CAH2-WzkpKeZJrXvR_p7VSY1b-s85E3gHyTbZQzR0BkJ5LrWF_A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Sep 30, 2018 at 2:33 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> > Now benchmarking is not clear. Possible performance degradation from TID
> > ordering interfere with positive effects from the optimizations in
> > non-trivial way.
>
> Is there any evidence of a regression in the last 2 versions?

I did find a pretty clear regression, though only with writes to
unique indexes. Attached is v6, which fixes the issue. More on that
below.

v6 also:

* Adds a new-to-v6 "insert at new item's insertion point"
optimization, which is broken out into its own commit.

This *greatly* improves the index bloat situation with the TPC-C
benchmark in particular, even before the benchmark starts (just with
the initial bulk load). See the relevant commit message for full
details, or a couple of my previous mails on this thread. I will
provide my own TPC-C test data + test case to any reviewer that wants
to see this for themselves. It shouldn't be hard to verify the
improvement in raw index size with any TPC-C implementation, though.
Please make an off-list request if you're interested. The raw dump is
1.8GB.

The exact details of when this new optimization kick in and how it
works are tentative. They should really be debated. Reviewers should
try to think of edge cases in which my "heap TID adjacency" approach
could make the optimization kick in when it shouldn't -- cases where
it causes bloat rather than preventing it. I couldn't find any such
regressions, but this code was written very recently.

I should also look into using HammerDB to do a real TPC-C benchmark,
and really put the patch to the test...anybody have experience with
it?

* Generally groups everything into a relatively manageable series of
cumulative improvements, starting with the infrastructure required to
physically truncate tuples correctly, without any of the smarts around
selecting a split point.

The base patch is useless on its own, since it's just necessary to
have the split point selection smarts to see a consistent benefit.
Reviewers shouldn't waste their time doing any real benchmarking with
just the first patch applied.

* Adds a lot of new information to the nbtree README, about the
high-level thought process behind the design, including citing the
classic paper that this patch was primarily inspired by.

* Adds a new, dedicated insertion scan key struct --
BTScanInsert[Data]. This is passed around to a number of different
routines (_btsearch(), _bt_binsrch(), _bt_compare(), etc). This was
suggested by Andrey, and also requested by Peter Eisentraut.

While this BTScanInsert work started out as straightforward
refactoring, it actually led to my discovering and fixing the
regression I mentioned. Previously, I passed a lower bound on a binary
search to _bt_binsrch() within _bt_findinsertloc(). This wasn't nearly
as effective as what the master branch does for unique indexes at the
same point -- it usually manages to reuse a result from an earlier
_bt_binsrch() as the offset for the new tuple, since it has no need to
worry about the new tuple's position *among duplicates* on the page.
In earlier versions of my patch, most of the work of a second binary
search took place, despite being redundant and unnecessary. This
happened for every new insertion into a non-unique index -- I could
easily measure the problem with a simple serial test case. I can see
no regression there against master now, though.

My fix for the regression involves including some mutable state in the
new BTScanInsert struct (within v6-0001-*patch), to explicitly
remember and restore some internal details across two binary searches
against the same leaf page. We now remember a useful lower *and* upper
bound within bt_binsrch(), which is what is truly required to fix the
regression. While there is still a second call to _bt_binsrch() within
_bt_findinsertloc() for unique indexes, it will do no comparisons in
the common case where there are no existing dead duplicate tuples in
the unique index. This means that the number of _bt_compare() calls we
get in this _bt_findinsertloc() unique index path is the same as the
master branch in almost all cases (I instrumented the regression tests
to make sure of this). I also think that having BTScanInsert will ease
things around pg_upgrade support, something that remains an open item.
Changes in this area seem to make everything clearer -- the signature
of _bt_findinsertloc() seemed a bit jumbled to me.

Aside: I think that this BTScanInsert mutable state idea could be
pushed even further in the future. "Dynamic prefix truncation" could
be implemented by taking a similar approach when descending composite
indexes for an index scan (doesn't have to be a unique index). We can
observe that earlier attributes must all be equal to our own scankey's
values once we descend the tree and pass between a pair of pivot
tuples where a common prefix (some number of leading attributes) is
fully equal. It's safe to just not bother comparing these prefix
attributes on lower levels, because we can reason about their values
transitively; _bt_compare() can be told to always skip the first
attribute or two during later/lower-in-the-tree binary searches. This
idea will not be implemented for Postgres v12 by me, though.

--
Peter Geoghegan

Attachment Content-Type Size
v6-0005-DEBUG-Add-pageinspect-instrumentation.patch application/x-patch 7.8 KB
v6-0006-DEBUG-Allow-nbtree-to-use-ASC-heap-TID-order.patch application/x-patch 7.0 KB
v6-0003-Add-split-at-new-tuple-page-split-optimization.patch application/x-patch 12.0 KB
v6-0004-DEBUG-Add-page-split-instrumentation.patch application/x-patch 3.6 KB
v6-0002-Weigh-suffix-truncation-when-choosing-a-split-poi.patch application/x-patch 40.3 KB
v6-0001-Make-nbtree-indexes-have-unique-keys-in-tuples.patch application/x-patch 137.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Wood 2018-10-04 00:01:44 Re: Skylake-S warning
Previous Message Andres Freund 2018-10-03 22:55:33 Re: Skylake-S warning