Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>
Subject: Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
Date: 2019-01-24 01:44:41
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

On Tue, Jan 8, 2019 at 4:47 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> Attached is v10 of the patch series, which has many changes based on
> your feedback. However, I didn't end up refactoring _bt_findsplitloc()
> in the way you described, because it seemed hard to balance all of the
> concerns there. I still have an open mind on this question, and
> recognize the merit in what you suggested. Perhaps it's possible to
> reach a compromise here.

> * Addresses Heikki's concerns about locking the metapage more
> frequently in a general way. Comments are added to nbtpage.c, and
> updated in a number of places that already talk about the same risk.

Attached is v11 of the patch, which removes the comments mentioned
here, and instead finds a way to not do new things with buffer locks.


* We simply avoid holding buffer locks while accessing the metapage.
(Of course, the old root page split stuff still does this -- it has
worked that way forever.)

* We also avoid calling index_getprocinfo() with any buffer lock held.
We'll still call support function 1 with a buffer lock held to
truncate, but that's not new -- *any* insertion will do that.

For some reason I got stuck on the idea that we need to use a
scankey's own values within _bt_truncate()/_bt_keep_natts() by
generating a new insertion scankey every time. We now simply ignore
those values, and call the comparator with pairs of tuples that each
come from the page directly. Usually, we'll just reuse the insertion
scankey that we were using for the insertion anyway (we no longer
build our own scankey for truncation). Other times, we'll build an
empty insertion scankey (one that has the function pointer and so on,
but no values). The only downside is that I cannot have an assertion
that calls _bt_compare() to make sure we truncated correctly there and
then, since a dedicated insertion scankey is no longer conveniently

I feel rather silly for not having gone this way from the beginning,
because the new approach is quite obviously simpler and safer.
nbtsort.c now gets a reusable, valueless insertion scankey that it
uses for both truncation and for setting up a merge of the two spools
for unique index builds. This approach allows me to remove
_bt_mkscankey_nodata() altogether -- callers build a "nodata"
insertion scankey with empty values by passing _bt_mkscankey() a NULL
tuple. That's equivalent to having an insertion scankey built from an
all-attributes-truncated tuple. IOW, the patch now makes the "nodata"
stuff a degenerate case of building a scankey from a
truncated-attributes tuple. tuplesort.c also uses the new BTScanInsert
struct. There is no longer any case where there in an insertion
scankey that isn't accessed using the BTScanInsert struct.

* No more pg_depend tie-breaker column commit. That was an ugly hack,
that I'm glad to be rid of -- many thanks to Tom for working through a
number of test instability issues that affected the patch. I do still
need to paper-over one remaining regression test issue/bug that the
patch happens to unmask, pending Tom fixing it directly [1]. This
papering-over is broken out into its own commit
("v11-0002-Paper-over-DEPENDENCY_INTERNAL_AUTO-bug-failures.patch"). I
expect that Tom will fix the bug before too long, at which point the
temporary work around can just be reverted from your local tree.


I feel that this version is pretty close to being commitable, since
everything about the design is settled. It completely avoids saying
anything new about buffer locking protocols, LWLock deadlock safety,
etc. VACUUM and crash recovery are also unchanged, so subtle bugs
should at least not be too hard to reproduce when observed once. It's
pretty complementary code: the new logic for picking a split point
builds a list of candidate split points using the old technique, with
a second pass to choose the best one for suffix truncation among the
accumulated list. Hard to see how that could introduce an invalid
split point choice.

I take the risk of introducing new corruption bugs very seriously.
contrib/amcheck now verifies all aspects of the new on-disk
representation. The stricter Lehman & Yao style invariant ("the
subtree S is described by Ki < v <= Ki + 1 ...") allows amcheck to be
stricter in what it will accept (e.g., heap TID needs to be in order
among logical duplicates, we always expect to see a representation of
the number of pivot tuple attributes, and we expect the high key to be
strictly greater than items on internal pages).


It would be very helpful if a reviewer such as Heikki or Alexander
could take a look at the patch once more. I suggest that they look at
the following points in the patch:

* The minusinfkey stuff, which is explained within _bt_compare(), and
within nbtree.h header comments. Page deletion by VACUUM is the only
_bt_search() caller that sets minusinfkey to true (though older
versions of btree and amcheck also set minusinfkey to true).

* Does the value of BTREE_SINGLEVAL_FILLFACTOR make sense? Am I being
a little too aggressive there, possibly hurting workloads where HOT
pruning occurs periodically? Sane duplicate handling is the most
compelling improvement that the patch makes, but I may still have been
a bit too aggressive in packing pages full of duplicates so tightly. I
figured that that was the closest thing to the previous behavior
that's still reasonable.

* Does the _bt_splitatnewitem() criteria for deciding if we should
split at the point the new tuple is positioned at miss some subtlety?
It's important that splitting at the new item when we shouldn't
doesn't happen, or hardly ever happens -- it should be
*self-limiting*. This was tested using BenchmarkSQL/TPC-C [2] -- TPC-C
has a workload where this particular enhancement makes indexes a lot

* There was also testing of index bloat following bulk insertions,
based on my own custom test suite. Data and indexes were taken from
TPC-C tables, TPC-H tables, TPC-E tables, UK land registry data [3],
and the Mouse Genome Database Project (Postgres schema + indexes) [4].
This takes almost an hour to run on my development machine, though the
most important tests finish in less than 5 minutes. I can provide
access to all or some of these tests, if reviewers are interested and
are willing to download several gigabytes of sample data that I'll
provide privately.

Peter Geoghegan

Attachment Content-Type Size
v11-0002-Paper-over-DEPENDENCY_INTERNAL_AUTO-bug-failures.patch application/x-patch 4.2 KB
v11-0001-Refactor-nbtree-insertion-scankeys.patch application/x-patch 54.2 KB
v11-0003-Treat-heap-TID-as-part-of-the-nbtree-key-space.patch application/x-patch 141.4 KB
v11-0004-Pick-nbtree-split-points-discerningly.patch application/x-patch 53.6 KB
v11-0007-DEBUG-Add-pageinspect-instrumentation.patch application/x-patch 7.8 KB
v11-0006-Add-high-key-continuescan-optimization.patch application/x-patch 8.6 KB
v11-0005-Add-split-at-new-tuple-page-split-optimization.patch application/x-patch 11.6 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2019-01-24 01:59:50 Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Previous Message Amit Langote 2019-01-24 01:34:17 Re: inherited primary key misbehavior