Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>
Subject: Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
Date: 2019-02-11 20:54:30
Message-ID: CAH2-Wz=yTWnVu+HeHGKb2AGiADL9eprn-cKYAto4MkKOuiGtRQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 5, 2019 at 4:50 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> Heikki and I had the opportunity to talk about this recently. We found
> an easy way forward. I believe that the nbtsplitloc.c algorithm itself
> is fine -- the code will need to be refactored, though.

Attached v12 does not include this change, though I have every
intention of doing the refactoring described for v13. The
nbtsplitloc.c/split algorithm refactoring would necessitate
revalidating the patch's performance, though, which didn't seem worth
blocking on. Besides, there was bit rot that needed to be fixed.

Notable improvements in v12:

* No more papering-over regression test differences caused by
pg_depend issues, thanks to recent work by Tom (today's commit
1d92a0c9).

* I simplified the code added to _bt_binsrch() to deal with saving and
restoring binary search bounds for _bt_check_unique()-caller
insertions (this is from first/"Refactor nbtree insertion scankeys"
patch). I also improved matters within _bt_check_unique() itself: the
early "break" there (based on reaching the known strict upper bound
from cache binary search) works in terms of the existing
_bt_check_unique() loop invariant.

This even allowed me to add a new assertion that makes sure that
breaking out of the loop early is correct -- we call _bt_isequal() for
next item on assert-enabled builds when we break having reached strict
upper bound established by initial binary search. In other words,
_bt_check_unique() ends up doing the same number of _bt_isequal()
calls as it did on the master branch, provided assertions are enabled.

* I've restored regression test coverage that the patch previously
inadvertently took away. Suffix truncation made deliberately-tall
B-Tree indexes from the regression tests much shorter, making the
tests fail to test the code paths the tests originally targeted. I
needed to find ways to "defeat" suffix truncation so I still ended up
with a fairly tall tree that hit various code paths.

I think that we went from having 8 levels in btree_tall_idx (i.e.
ridiculously many) to having only a single root page when I first
caught the problem! Now btree_tall_idx only has 3 levels, which is all
we really need. Even multi-level page deletion didn't have any
coverage in previous versions. I used gcov to specifically verify that
we have good multi-level page deletion coverage. I also used gcov to
make sure that we have coverage of the v11 "cache rightmost block"
optimization, since I noticed that that was missing (though present on
the master branch) -- that's actually all that the btree_tall_idx
tests in the patch, since multi-level page deletion is covered by a
covering-indexes-era test. Finally, I made sure that we have coverage
of fast root splits. In general, I preserved the original intent
behind the existing tests, all of which I was fairly familiar with
from previous projects.

* I've added a new "relocate" bt_index_parent_check()/amcheck option,
broken out in a separate commit. This new option makes verification
relocate each and every leaf page tuple, starting from the root each
time. This means that there will be at least one piece of code that
specifically relies on "every tuple should have a unique key" from the
start, which seems like a good idea.

This enhancement to amcheck allows me to detect various forms of
corruption that no other existing verification option would catch. In
particular, I can catch various very subtle "cross-cousin
inconsistencies" that require that we verify a page using its
grandparent rather than its parent [1] (existing checks catch some but
not all "cousin problem" corruption). Simply put, this amcheck
enhancement allows me to detect corruption of the least significant
byte in a key value in the root page -- that kind of corruption will
cause index scans to miss only a small number of tuples at the leaf
level. Maybe this scenario isn't realistic, but I'd rather not take
any chances.

* I rethought the "single value mode" fillfactor, which I've been
suspicious of for a while now. It's now 96, down from 99.

Micro-benchmarks involving concurrent sessions inserting into a low
cardinality index led me to the conclusion that 99 was aggressively
high. It was not that hard to get excessive page splits with these
microbenchmarks, since insertions with monotonically increasing heap
TIDs arrived a bit out of order with a lot of concurrency. 99 worked a
bit better than 96 with only one session, but significantly worse with
concurrent sessions. I still think that it's a good idea to be more
aggressive than default leaf fillfactor, but reducing "single value
mode" fillfactor to 90 (or whatever the user set general leaf
fillfactor to) wouldn't be so bad.

[1] http://subs.emis.de/LNI/Proceedings/Proceedings144/32.pdf
--
Peter Geoghegan

Attachment Content-Type Size
v12-0007-DEBUG-Add-pageinspect-instrumentation.patch application/x-patch 7.8 KB
v12-0006-Allow-tuples-to-be-relocated-from-root-by-amchec.patch application/x-patch 16.9 KB
v12-0004-Add-split-after-new-tuple-optimization.patch application/x-patch 9.7 KB
v12-0005-Add-high-key-continuescan-optimization.patch application/x-patch 8.6 KB
v12-0003-Pick-nbtree-split-points-discerningly.patch application/x-patch 53.7 KB
v12-0002-Treat-heap-TID-as-part-of-the-nbtree-key-space.patch application/x-patch 153.1 KB
v12-0001-Refactor-nbtree-insertion-scankeys.patch application/x-patch 55.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ramanarayana 2019-02-11 20:57:31 Re: BUG #15548: Unaccent does not remove combining diacritical characters
Previous Message Tom Lane 2019-02-11 20:38:47 Re: [Patch] Log10 and hyperbolic functions for SQL:2016 compliance