Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>
Subject: Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
Date: 2019-01-09 00:47:55
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

On Fri, Dec 28, 2018 at 3:32 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> On Fri, Dec 28, 2018 at 3:20 PM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> > I'm envisioning that you have an array, with one element for each item
> > on the page (including the tuple we're inserting, which isn't really on
> > the page yet). In the first pass, you count up from left to right,
> > filling the array. Next, you compute the complete penalties, starting
> > from the middle, walking outwards.

> Ah, right. I think I see what you mean now.

> Leave it with me. I'll need to think about this some more.

Attached is v10 of the patch series, which has many changes based on
your feedback. However, I didn't end up refactoring _bt_findsplitloc()
in the way you described, because it seemed hard to balance all of the
concerns there. I still have an open mind on this question, and
recognize the merit in what you suggested. Perhaps it's possible to
reach a compromise here.

I did refactor the _bt_findsplitloc() stuff to make the division of
work clearer, though -- I think that you'll find that to be a clear
improvement, even though it's less than what you asked for. I also
moved all of the _bt_findsplitloc() stuff (old and new) into its own
.c file, nbtsplicloc.c, as you suggested.

Other significant changes

* Creates a new commit that changes routines like _bt_search() and
_bt_binsrch() to use a dedicated insertion scankey struct, per request
from Heikki.

* As I mentioned in passing, many other small changes to comments, the
nbtree README, and the commit messages based on your (Heikki's) first
round of review.

* v10 generalizes the previous _bt_lowest_scantid() logic for adding a
tie-breaker on equal pivot tuples during a descent of a B-Tree.

The new code works with any truncated attribute, not just a truncated
heap TID (I removed _bt_lowest_scantid() entirely). This also allowed
me to remove a couple of places that previously opted in to
_bt_lowest_scantid(), since the new approach can work without anybody
explicitly opting in. As a bonus, the new approach makes the patch
faster, since remaining queries where we unnecessarily follow an
equal-though-truncated downlink are fixed (it's usually only the heap
TID that's truncated when we can do this, but not always).

The idea behind this new generalized approach is to recognize that
minus infinity is an artificial/sentinel value that doesn't appear in
real keys (it only appears in pivot tuples). The majority of callers
(all callers aside from VACUUM's leaf page deletion code) can
therefore go to the right of a pivot that has all-equal attributes, if
and only if:

1. The pivot has at least one truncated/minus infinity attribute *and*

2. The number of attributes matches the scankey.

In other words, we tweak the comparison logic to add a new
tie-breaker. There is no change to the on-disk structures compared to
v9 of the patch -- I've only made index scans able to take advantage
of minus infinity values in *all* cases.

If this explanation is confusing to somebody less experienced with
nbtree than Heikki: consider the way we descend *between* the values
on internal pages, rather than expecting exact matches. _bt_binsrch()
behaves slightly differently when doing a binary search on an internal
page already: equality actually means "go left" when descending the
tree (though it doesn't work like that on leaf pages, where insertion
scankeys almost always search for a >= match). We want to "go right"
instead in cases where it's clear that tuples of interest to our scan
can only be in that child page (we're rarely searching for a minus
infinity value, since that doesn't appear in real tuples). (Note that
this optimization has nothing to do with "moving right" to recover
from concurrent page splits -- we would have relied on code like
_bt_findsplicloc() and _bt_readpage() to move right once we reach the
leaf level when we didn't have this optimization, but that code isn't
concerned with recovering from concurrent page splits.)

Minor changes

* Addresses Heikki's concerns about locking the metapage more
frequently in a general way. Comments are added to nbtpage.c, and
updated in a number of places that already talk about the same risk.

The master branch seems to be doing much the same thing in similar
situations already (e.g. during a root page split, when we need to
finish an interrupted page split but don't have a usable
parent/ancestor page stack). Importantly, the patch does not change
the dependency graph.

* Small changes to user docs where existing descriptions of things
seem to be made inaccurate by the patch.


I have also recently been doing a lot of automated benchmarking. Here
are results of a BenchmarkSQL benchmark (plus various instrumentation)
as a bz2 archive:

I completed on my home server last night, against v10 of the patch
series. Note that there were 4 runs for each case (master case +
public/patch case), with each run lasting 2 hours (so the benchmark
took over 8 hours once you include bulk loading time). There were 400
"warehouses" (this is similar to pgbench's scale factor), and 16
terminals/clients. This left the database 110GB+ in size on a server
with 32GB of memory and a fast consumer grade SSD. Autovacuum was
tuned to perform aggressive cleanup of bloat. All the settings used
are available in the bz2 archive (there are "settings" output files,


See the html "report" files for a quick visual indication of how the
tests progresses. BenchmarkSQL uses R to produce useful graphs, which
is quite convenient. (I have automated a lot of this with my own ugly

We see a small but consistent increase in transaction throughput here,
as well as a small but consistent decrease in average latency for each
class of transaction. There is also a large and consistent decrease in
the on-disk size of indexes, especially if you just consider the
number of internal pages (diff the "balance" files to see what I
mean). Note that the performance is expected to degrade across runs,
since the database is populated once, at the start, and has more data
added over time; the important thing is that run n on master be
compared to run n on public/patch. Note also that I use my own fork of
BenchmarkSQL that does its CREATE INDEX before initial bulk loading,
not after [1]. It'll take longer to see problems on Postgres master if
the initial bulk load does CREATE INDEX after BenchmarkSQL workers
populate tables (we only need INSERTs to see significant index bloat).
Avoiding pristine indexes at the start of the benchmark makes the
problems on the master branch apparent sooner.

The benchmark results also include things like pg_statio* +
pg_stat_bgwriter view output (reset between test runs), which gives
some insight into what's going on. Checkpoints tend to write out a few
more dirty buffers with the patch, while there is a much larger drop
in the number of buffers written out by backends. There are probably
workloads where we'd see a much larger increase in transaction
throughput -- TPC-C happens to access index pages with significant
locality, and happens to be very write-heavy, especially compared to
the more modern (though less influential) TPC-E benchmark. Plus, the
TPC-C workload isn't at all helped by the fact that the patch will
never "get tired", even though that's the most notable improvement

Peter Geoghegan

Attachment Content-Type Size
v10-0002-Add-pg_depend-index-scan-tiebreaker-column.patch application/x-patch 15.1 KB
v10-0001-Refactor-nbtree-insertion-scankeys.patch application/x-patch 52.6 KB
v10-0004-Pick-nbtree-split-points-discerningly.patch application/x-patch 54.5 KB
v10-0005-Add-split-at-new-tuple-page-split-optimization.patch application/x-patch 10.3 KB
v10-0003-Treat-heap-TID-as-part-of-the-nbtree-key-space.patch application/x-patch 143.7 KB
v10-0007-DEBUG-Add-pageinspect-instrumentation.patch application/x-patch 7.8 KB
v10-0006-Add-high-key-continuescan-optimization.patch application/x-patch 8.2 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2019-01-09 01:01:12 Re: New vacuum option to do only freezing
Previous Message Andrew Dunstan 2019-01-09 00:41:55 Re: BUG #15446: Crash on ALTER TABLE