Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>
Subject: Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
Date: 2018-07-17 19:21:27
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

Attached is my v3, which has some significant improvements:

* The hinting for unique index inserters within _bt_findinsertloc()
has been restored, more or less.

* Bug fix for case where left side of split comes from tuple being
inserted. We need to pass this to _bt_suffix_truncate() as the left
side of the split, which we previously failed to do. The amcheck
coverage I've added allowed me to catch this issue during a benchmark.
(I use amcheck during benchmarks to get some amount of stress-testing

* New performance optimization that allows us to descend a downlink
when its user-visible attributes have scankey-equal values. We avoid
an unnecessary move left by using a sentinel scan tid that's less than
any possible real heap TID, but still greater than minus infinity to

I am now considering pursuing this as a project in its own right,
which can be justified without being part of some larger effort to add
retail index tuple deletion (e.g. by VACUUM). I think that I can get
it to the point of being a totally unambiguous win, if I haven't
already. So, this patch is no longer just an interesting prototype of
a new architectural direction we should take. In any case, it has far
fewer problems than v2.

Testing the performance characteristics of this patch has proven
difficult. My home server seems to show a nice win with a pgbench
workload that uses a Gaussian distribution for the pgbench_accounts
queries (script attached). That seems consistent and reproducible. My
home server has 32GB of RAM, and has a Samsung SSD 850 EVO SSD, with a
250GB capacity. With shared_buffers set to 12GB, 80 minute runs at
scale 4800 look like this:


25 clients:
tps = 15134.223357 (excluding connections establishing)

50 clients:
tps = 13708.419887 (excluding connections establishing)

75 clients:
tps = 12951.286926 (excluding connections establishing)

90 clients:
tps = 12057.852088 (excluding connections establishing)


25 clients:
tps = 17857.863353 (excluding connections establishing)

50 clients:
tps = 14319.514825 (excluding connections establishing)

75 clients:
tps = 14015.794005 (excluding connections establishing)

90 clients:
tps = 12495.683053 (excluding connections establishing)

I ran this twice, and got pretty consistent results each time (there
were many other benchmarks on my home server -- this was the only one
that tested this exact patch, though). Note that there was only one
pgbench initialization for each set of runs. It looks like a pretty
strong result for the patch - note that the accounts table is about
twice the size of available main memory. The server is pretty well
overloaded in every individual run.

Unfortunately, I have a hard time showing much of any improvement on a
storage-optimized AWS instance with EBS storage, with scaled up
pgbench scale and main memory. I'm using an i3.4xlarge, which has 16
vCPUs, 122 GiB RAM, and 2 SSDs in a software RAID0 configuration. It
appears to more or less make no overall difference there, for reasons
that I have yet to get to the bottom of. I conceived this AWS
benchmark as something that would have far longer run times with a
scaled-up database size. My expectation was that it would confirm the
preliminary result, but it hasn't.

Maybe the issue is that it's far harder to fill the I/O queue on this
AWS instance? Or perhaps its related to the higher latency of EBS,
compared to the local SSD on my home server? I would welcome any ideas
about how to benchmark the patch. It doesn't necessarily have to be a
huge win for a very generic workload like the one I've tested, since
it would probably still be enough of a win for things like free space
management in secondary indexes [1]. Plus, of course, it seems likely
that we're going to eventually add retail index tuple deletion in some
form or another, which this is prerequisite to.

For a project like this, I expect an unambiguous, across the board win
from the committed patch, even if it isn't a huge win. I'm encouraged
by the fact that this is starting to look like credible as a
stand-alone patch, but I have to admit that there's probably still
significant gaps in my understanding of how it affects real-world
performance. I don't have a lot of recent experience with benchmarking
workloads like this one.

Peter Geoghegan

Attachment Content-Type Size
dell-server.txt text/plain 539 bytes
tpcb.sql application/sql 547 bytes
v3-0001-Make-all-nbtree-index-tuples-have-unique-keys.patch text/x-patch 105.8 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2018-07-17 20:12:34 Re: "Write amplification" is made worse by "getting tired" while inserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)
Previous Message Peter Eisentraut 2018-07-17 19:12:12 Re: patch to allow disable of WAL recycling