Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>
Subject: Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
Date: 2019-02-14 06:47:03
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

On Mon, Feb 11, 2019 at 12:54 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> Notable improvements in v12:

I've been benchmarking v12, once again using a slightly modified
BenchmarkSQL that doesn't do up-front CREATE INDEX builds [1], since
the problems with index bloat don't take so long to manifest
themselves when the indexes are inserted into incrementally from the
very beginning. This benchmarking process took over 20 hours, with a
database that started off at about 90GB (700 TPC-C/BenchmarkSQL
warehouses were used). That easily exceeded available main memory on
my test server, which was 32GB. This is a pretty I/O bound workload,
and a fairly write-heavy one at that. I used a Samsung 970 PRO 512GB,
NVMe PCIe M.2 2280 SSD for both pg_wal and the default and only

Importantly, I figured out that I should disable both hash joins and
merge joins with BenchmarkSQL, in order to force all joins to be
nested loop joins. Otherwise, the "stock level" transaction eventually
starts to use a hash join, even though that's about 10x slower than a
nestloop join (~4ms vs. ~40ms on this machine) -- the hash join
produces a lot of noise without really testing anything. It usually
takes a couple of hours before we start to get obviously-bad plans,
but it also usually takes about that long until the patch series
starts to noticeably overtake the master branch. I don't think that
TPC-C will ever benefit from using a hash join or a merge join, since
it's supposed to be a pure OLTP benchmark, and is a benchmark that
MySQL is known to do at least respectably-well on.

This is the first benchmark I've published that was considerably I/O
bound. There are significant improvements in performance across the
board, on every measure, though it takes several hours for that to
really show. The benchmark was not rate-limited. 16
clients/"terminals" are used throughout. There were 5 runs for master
and 5 for patch, interlaced, each lasting 2 hours. Initialization
occurred once, so it's expected that both databases will gradually get
larger across runs.

Summary (appears in same order as the execution of each run) -- each
run is 2 hours, so 20 hours total excluding initial load time (2 hours
* 5 runs for master + 2 hours * 5 runs for patch):

Run 1 -- master: Measured tpmTOTAL = 90063.79, Measured tpmC
(NewOrders) = 39172.37
Run 1 -- patch: Measured tpmTOTAL = 90922.63, Measured tpmC
(NewOrders) = 39530.2

Run 2 -- master: Measured tpmTOTAL = 77091.63, Measured tpmC
(NewOrders) = 33530.66
Run 2 -- patch: Measured tpmTOTAL = 83905.48, Measured tpmC
(NewOrders) = 36508.38 <-- 8.8% increase in tpmTOTAL/throughput

Run 3 -- master: Measured tpmTOTAL = 71224.25, Measured tpmC
(NewOrders) = 30949.24
Run 3 -- patch: Measured tpmTOTAL = 78268.29, Measured tpmC
(NewOrders) = 34021.98 <-- 9.8% increase in tpmTOTAL/throughput

Run 4 -- master: Measured tpmTOTAL = 71671.96, Measured tpmC
(NewOrders) = 31163.29
Run 4 -- patch: Measured tpmTOTAL = 73097.42, Measured tpmC
(NewOrders) = 31793.99

Run 5 -- master: Measured tpmTOTAL = 66503.38, Measured tpmC
(NewOrders) = 28908.8
Run 5 -- patch: Measured tpmTOTAL = 71072.3, Measured tpmC (NewOrders)
= 30885.56 <-- 6.9% increase in tpmTOTAL/throughput

There were *also* significant reductions in transaction latency for
the patch -- see the full html reports in the provided tar archive for
full details (URL provided below). The html reports have nice SVG
graphs, generated by BenchmarkSQL using R -- one for transaction
throughput, and another for transaction latency. The overall picture
is that the patched version starts out ahead, and has a much more
gradual decline as the database becomes larger and more bloated.

Note also that the statistics collector stats show a *big* reduction
in blocks read into shared_buffers for the duration of these runs. For
example, here is what pg_stat_database shows for run 3 (I reset the
stats between runs):

master: blks_read = 78,412,640, blks_hit = 4,022,619,556
patch: blks_read = 70,033,583, blks_hit = 4,505,308,517 <-- 10.7%
reduction in blks_read/logical I/O

This suggests an indirect benefit, likely related to how buffers are
evicted in each case. pg_stat_bgwriter indicates that more buffers are
written out during checkpoints, while fewer are written out by
backends. I won't speculate further on what all of this means right
now, though.

You can find the raw details for blks_read for each and every run in
the full tar archive. It is available for download from:

There are also dumps of the other pg_stat* views at the end of each
run, logs for each run, etc. There's more information than anybody
else is likely to find interesting.

If anyone needs help in recreating this benchmark, then I'd be happy
to assist in that. The is a shell script (zsh) included in the tar
archive, although that will need to be changed a bit to point to the
correct installations and so on. Independent validation of the
performance of the patch series on this and other benchmarks is very

Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2019-02-14 06:51:56 Re: [PATCH] xlogreader: do not read a file block twice
Previous Message Andrey Lepikhov 2019-02-14 06:10:20 Re: WAL insert delay settings