Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>
Subject: Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
Date: 2019-03-18 23:34:43
Message-ID: CAH2-Wzkb_wnQkyg9wW6DtZywU1e7xbzx8dkDGv1r-yrfYrmQFw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 12, 2019 at 11:40 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I think it's pretty clear that we have to view that as acceptable. I
> mean, we could reduce contention even further by finding a way to make
> indexes 40% larger, but I think it's clear that nobody wants that.
> Now, maybe in the future we'll want to work on other techniques for
> reducing contention, but I don't think we should make that the problem
> of this patch, especially because the regressions are small and go
> away after a few hours of heavy use. We should optimize for the case
> where the user intends to keep the database around for years, not
> hours.

I came back to the question of contention recently. I don't think it's
okay to make contention worse in workloads where indexes are mostly
the same size as before, such as almost any workload that pgbench can
simulate. I have made a lot of the fact that the TPC-C indexes are
~40% smaller, in part because lots of people outside the community
find TPC-C interesting, and in part because this patch series is
focused on cases where we currently do unusually badly (cases where
good intuitions about how B-Trees are supposed to perform break down
[1]). These pinpointable problems must affect a lot of users some of
the time, but certainly not all users all of the time.

The patch series is actually supposed to *improve* the situation with
index buffer lock contention in general, and it looks like it manages
to do that with pgbench, which doesn't do inserts into indexes, except
for those required for non-HOT updates. pgbench requires relatively
few page splits, but is in every other sense a high contention
workload.

With pgbench scale factor 20, here are results for patch and master
with a Gaussian distribution on my 8 thread/4 core home server, with
each run reported lasting 10 minutes, repeating twice for client
counts 1, 2, 8, 16, and 64, patch and master branch:

\set aid random_gaussian(1, 100000 * :scale, 20)
\set bid random(1, 1 * :scale)
\set tid random(1, 10 * :scale)
\set delta random(-5000, 5000)
BEGIN;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES
(:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
END;

1st pass
========

(init pgbench from scratch for each database, scale 20)

1 client master:
tps = 7203.983289 (including connections establishing)
tps = 7204.020457 (excluding connections establishing)
latency average = 0.139 ms
latency stddev = 0.026 ms
1 client patch:
tps = 7012.575167 (including connections establishing)
tps = 7012.590007 (excluding connections establishing)
latency average = 0.143 ms
latency stddev = 0.020 ms

2 clients master:
tps = 13434.043832 (including connections establishing)
tps = 13434.076194 (excluding connections establishing)
latency average = 0.149 ms
latency stddev = 0.032 ms
2 clients patch:
tps = 13105.620223 (including connections establishing)
tps = 13105.654109 (excluding connections establishing)
latency average = 0.153 ms
latency stddev = 0.033 ms

8 clients master:
tps = 27126.852038 (including connections establishing)
tps = 27126.986978 (excluding connections establishing)
latency average = 0.295 ms
latency stddev = 0.095 ms
8 clients patch:
tps = 27945.457965 (including connections establishing)
tps = 27945.565242 (excluding connections establishing)
latency average = 0.286 ms
latency stddev = 0.089 ms

16 clients master:
tps = 32297.612323 (including connections establishing)
tps = 32297.743929 (excluding connections establishing)
latency average = 0.495 ms
latency stddev = 0.185 ms
16 clients patch:
tps = 33434.889405 (including connections establishing)
tps = 33435.021738 (excluding connections establishing)
latency average = 0.478 ms
latency stddev = 0.167 ms

64 clients master:
tps = 25699.029787 (including connections establishing)
tps = 25699.217022 (excluding connections establishing)
latency average = 2.482 ms
latency stddev = 1.715 ms
64 clients patch:
tps = 26513.816673 (including connections establishing)
tps = 26514.013638 (excluding connections establishing)
latency average = 2.405 ms
latency stddev = 1.690 ms

2nd pass
========

(init pgbench from scratch for each database, scale 20)

1 client master:
tps = 7172.995796 (including connections establishing)
tps = 7173.013472 (excluding connections establishing)
latency average = 0.139 ms
latency stddev = 0.022 ms
1 client patch:
tps = 7024.724365 (including connections establishing)
tps = 7024.739237 (excluding connections establishing)
latency average = 0.142 ms
latency stddev = 0.021 ms

2 clients master:
tps = 13489.016303 (including connections establishing)
tps = 13489.047968 (excluding connections establishing)
latency average = 0.148 ms
latency stddev = 0.032 ms
2 clients patch:
tps = 13210.292833 (including connections establishing)
tps = 13210.321528 (excluding connections establishing)
latency average = 0.151 ms
latency stddev = 0.029 ms

8 clients master:
tps = 27470.112858 (including connections establishing)
tps = 27470.229891 (excluding connections establishing)
latency average = 0.291 ms
latency stddev = 0.093 ms
8 clients patch:
tps = 28132.981815 (including connections establishing)
tps = 28133.096414 (excluding connections establishing)
latency average = 0.284 ms
latency stddev = 0.081 ms

16 clients master:
tps = 32409.399669 (including connections establishing)
tps = 32409.533400 (excluding connections establishing)
latency average = 0.493 ms
latency stddev = 0.182 ms
16 clients patch:
tps = 33678.304986 (including connections establishing)
tps = 33678.427420 (excluding connections establishing)
latency average = 0.475 ms
latency stddev = 0.168 ms

64 clients master:
tps = 25864.453485 (including connections establishing)
tps = 25864.639098 (excluding connections establishing)
latency average = 2.466 ms
latency stddev = 1.698 ms
64 clients patch:
tps = 26382.926218 (including connections establishing)
tps = 26383.166692 (excluding connections establishing)
latency average = 2.417 ms
latency stddev = 1.678 ms

There was a third run which has been omitted, because it's practically
the same as the first two. The order that results appear in is the
order things actually ran in (I like to interlace master and patch
runs closely).

Analysis
========

There seems to be a ~2% regression with one or two clients, but we
more than make up for that as the client count goes up -- the 8 and 64
client cases improve throughput by ~2.5%, and the 16 client case
improves throughput by ~4%. This seems like a totally reasonable
trade-off to me. As I said already, the patch isn't really about
workloads that we already do acceptably well on, such as this one, so
you're not expected to be impressed with these numbers. My goal is to
show that boring workloads that fit everything in shared_buffers
appear to be fine. I think that that's a reasonable conclusion, based
on these numbers. Lower client count cases are generally considered
less interesting, and also lose less in throughput than we go on to
gain later. as more clients are added. I'd be surprised if anybody
complained.

I think that the explanation for the regression with one or two
clients boils down to this: We're making better decisions about where
to split pages, and even about how pages are accessed by index scans
(more on that in the next paragraph). However, this isn't completely
free (particularly the page split stuff), and it doesn't pay for
itself until the number of clients ramps up. However, not being more
careful about that stuff is penny wise, pound foolish. I even suspect
that there are priority inversion issues when there is high contention
during unique index enforcement, which might be a big problem on
multi-socket machines with hundreds of clients. I am not in a position
to confirm that right now, but we have heard reports that are
consistent with this explanation at least once before now [2]. Zipfian
was also somewhat better when I last measured it, using the same
fairly modest machine -- I didn't repeat that here because I wanted
something simple and widely studied.

The patch establishes the principle that there is only one good reason
to visit more than one leaf page within index scans like those used by
pgbench: a concurrent page split, where the scan simply must go right
to find matches that were just missed in the first leaf page. That
should be very rare. We should never visit two leaf pages because
we're confused about where there might be matches. There is simply no
good reason for there to be any ambiguity or confusion.

The patch could still make index scans like these visit more than a
single leaf page for a bad reason, at least in theory: when there is
at least ~400 duplicates in a unique index, and we therefore can't
possibly store them all on one leaf page, index scans will of course
have to visit more than one leaf page. Again, that should be very
rare. All index scans can now check the high key on the leaf level,
and avoid going right when they happen to be very close to the right
edge of the leaf page's key space. And, we never have to take the
scenic route when descending the tree on an equal internal page key,
since that condition has practically been eliminated by suffix
truncation. No new tuple can be equal to negative infinity, and
negative infinity appears in every pivot tuple. There is a place for
everything, and everything is in its place.

[1] https://www.commandprompt.com/blog/postgres_autovacuum_bloat_tpc-c/
[2] https://postgr.es/m/BF3B6F54-68C3-417A-BFAB-FB4D66F2B410@postgrespro.ru
--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2019-03-18 23:35:17 pg_upgrade version checking questions
Previous Message Thomas Munro 2019-03-18 23:25:42 Re: Rare SSL failures on eelpout