Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>
Subject: Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
Date: 2018-11-13 01:47:45
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

On Sun, Nov 4, 2018 at 10:58 AM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> Just the lack of pg_upgrade support.

Attached is v7 of the patch series. Changes:

* Pre-pg_upgrade indexes (indexes of an earlier BTREE_VERSION) are now
supported. Using pg_upgrade will be seamless to users. "Getting tired"
returns, for the benefit of old indexes that regularly have lots of
duplicates inserted.

Notably, the new/proposed version of btree (BTREE_VERSION 4) cannot be
upgraded on-the-fly -- we're changing more than the contents of the
metapage, so that won't work. Version 2 -> version 3 upgrades can
still take place dynamically/on-the-fly. It you want to upgrade to
version 4, you'll need to REINDEX. The performance of the patch with
pg_upgrade'd indexes has been validated. There doesn't seem to be any

amcheck checks both the old invariants, and the new/stricter/L&Y
invariants. Which set are checked depends on the btree version of the
index undergoing verification.

* ASC heap TID order is now used -- not DESC order, as before. This
fixed all performance regressions that I'm aware of, and seems quite a
lot more elegant overall.

I believe that the patch series is now an unambiguous, across the
board win for performance. I could see about a 1% increase in
transaction throughput with my own TPC-C tests, while the big drop in
the size of indexes was preserved. pgbench testing also showed as much
as a 3.5% increase in transaction throughput in some cases with
non-uniform distributions. Thanks for the suggestion, Heikki!

Unfortunately, and as predicted, this change created a new problem
that I need to fix directly: it makes certain diagnostic messages that
accidentally depend on a certain pg_depend scan order say something
different, and less useful (though still technically correct). I'll
tackle that problem over on the dedicated thread I started [1]. (For
now, I include a separate patch to paper over questionable regression
test changes in a controlled way:

* New optimization that has index scans avoid visiting the next page
by checking the high key -- this is broken out into its own commit

This is related to an optimization that has been around for years --
we're now using the high key, rather than using a normal (non-pivot)
index tuple. High keys are much more likely to indicate that the scan
doesn't need to visit the next page with the earlier patches in the
patch series applied, since the new logic for choosing a split point
favors a high key with earlier differences. It's pretty easy to take
advantage of that. With a composite index, or a secondary index, it's
particularly likely that we can avoid visiting the next leaf page. In
other words, now that we're being smarter about future locality of
access during page splits, we should take full advantage during index

The v7-0001-Make-nbtree-indexes-have-unique-keys-in-tuples.patch
commit uses a _bt_lowest_scantid() sentinel value to avoid
unnecessarily visiting a page to the left of the page we actually
ought to go to directly during a descent of a B-Tree -- that
optimization was around in all earlier versions of the patch series.
It seems natural to also have this new-to-v7 optimization. It avoids
unnecessarily going right once we reach the leaf level, so it "does
the same thing on the right side" -- the two optimizations mirror each
other. If you don't get what I mean by that, then imagine a secondary
index where each value appears a few hundred times. Literally every
simple lookup query will either benefit from the first optimization on
the way down the tree, or from the second optimization towards the end
of the scan. (The page split logic ought to pack large groups of
duplicates together, ideally confining them to one leaf page.)

Andrey: the BTScanInsert struct still has the restorebinsrch stuff
(mutable binary search optimization state) in v7. It seemed to make
sense to keep it there, because I think that we'll be able to add
similar optimizations in the future, that use similar mutable state.
See my remarks on "dynamic prefix truncation" [2]. I think that that
could be very helpful with skip scans, for example, so we'll probably
end up adding it before too long. I hope you don't feel too strongly
about it.

Peter Geoghegan

Attachment Content-Type Size
v7-0005-Temporarily-paper-over-problematic-regress-output.patch application/x-patch 8.5 KB
v7-0006-DEBUG-Add-pageinspect-instrumentation.patch application/x-patch 7.8 KB
v7-0004-Add-high-key-continuescan-optimization.patch application/x-patch 8.1 KB
v7-0002-Weigh-suffix-truncation-when-choosing-a-split-poi.patch application/x-patch 41.1 KB
v7-0003-Add-split-at-new-tuple-page-split-optimization.patch application/x-patch 12.2 KB
v7-0001-Make-nbtree-indexes-have-unique-keys-in-tuples.patch application/x-patch 186.3 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Julian Hsiao 2018-11-13 02:02:22 Possible buffer overrun in src/backend/libpq/hba.c gethba_options()
Previous Message Amit Langote 2018-11-13 01:34:50 Re: move PartitionBoundInfo creation code