Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>
Subject: Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
Date: 2019-03-12 02:47:29
Message-ID: CAH2-Wz=L8G68UAWrW_UTvac-cb_SmSP-ReNzV6LHoKYmx1uE-Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Mar 10, 2019 at 5:17 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> The regression that I mentioned earlier isn't in pgbench type
> workloads (even when the distribution is something more interesting
> that the uniform distribution default). It is only in workloads with
> lots of page splits and lots of index churn, where we get most of the
> benefit of the patch, but also where the costs are most apparent.
> Hopefully it can be fixed, but if not I'm inclined to think that it's
> a price worth paying. This certainly still needs further analysis and
> discussion, though. This revision of the patch does not attempt to
> address that problem in any way.

I believe that I've figured out what's going on here.

At first, I thought that this regression was due to the cycles that
have been added to page splits, but that doesn't seem to be the case
at all. Nothing that I did to make page splits faster helped (e.g.
temporarily go back to doing them "bottom up" made no difference). CPU
utilization was consistently slightly *higher* with the master branch
(patch spent slightly more CPU time idle). I now believe that the
problem is with LWLock/buffer lock contention on index pages, and that
that's an inherent cost with a minority of write-heavy high contention
workloads. A cost that we should just accept.

Making the orderline primary key about 40% smaller increases
contention when BenchmarkSQL is run with this particular
configuration. The latency for the NEW_ORDER transaction went from
~4ms average on master to ~5ms average with the patch, while the
latency for other types of transactions was either unchanged or
improved. It's noticeable, but not that noticeable. This needs to be
put in context. The final transactions per minute for this
configuration was 250,000, with a total of only 100 warehouses. What
this boils down to is that the throughput per warehouse is about 8000%
of the maximum valid NOPM specified by the TPC-C spec [1]. In other
words, the database is too small relative to the machine, by a huge
amount, making the result totally and utterly invalid if you go on
what the TPC-C spec says. This exaggerates the LWLock/buffer lock
contention on index pages.

TPC-C is supposed to simulate a real use case with a plausible
configuration, but the details here are totally unrealistic. For
example, there are 3 million customers here (there are always 30k
customers per warehouse). 250k TPM means that there were about 112k
new orders per minute. It's hard to imagine a population of 3 million
customers making 112k orders per minute. That's over 20 million orders
in the first 3 hour long run that I got these numbers from. Each of
these orders has an average of about 10 line items. These people must
be very busy, and must have an awful lot of storage space in their
homes! (There are various other factors here, such as skew, and the
details will never be completely realistic anyway, but you take my
point. TPC-C is *designed* to be a realistic distillation of a real
use case, going so far as to require usable GUI input terminals when
evaluating a formal benchmark submission.)

The benchmark that I posted in mid-February [2] (which showed better
performance across the board) was much closer to what the TPC-C spec
requires -- that was only ~400% of maximum valid NOPM (the
BenchmarkSQL html reports will tell you this if you download the
archive I posted), and had 2,000 warehouses. TPC-C is *supposed* to be
I/O bound, and I/O bound workloads are what the patch helps with the
most. The general idea with TPC-C's NOPM is that you're required to
increase the number of warehouses as throughput increases. This stops
you from getting an unrealistically favorable result by churning
through a small amount of data, from the same few warehouses.

The only benchmark that I ran that actually satisfied TPC-C's NOPM
requirements had a total of 7,000 warehouses, and was almost a full
terabyte in size on the master branch. This was run on an i3.4xlarge
high I/O AWS ec2 instance. That was substantially I/O bound, and had
an improvement in throughput that was very similar to the mid-February
results which came from my home server -- we see a ~7.5% increase in
transaction throughput after a few hours. I attach a graph of block
device reads/writes for the second 4 hour run for this same 7,000
warehouse benchmark (master and patch). This shows a substantial
reduction in I/O according to OS-level instrumentation. (Note that the
same FS/logical block device was used for both WAL and database
files.)

In conclusion: I think that this regression is a cost worth accepting.
The regression in throughput is relatively small (2% - 3%), and the
NEW_ORDER transaction seems to be the only problem (NEW_ORDER happens
to be used for 45% of all transactions with TPC-C, and inserts into
the largest index by far, without reading much). The patch overtakes
master after a few hours anyway -- the patch will still win after
about 6 hours, once the database gets big enough, despite all the
contention. As I said, I think that we see a regression *because* the
indexes are much smaller, not in spite of the fact that they're
smaller. The TPC-C/BenchmarkSQL indexes never fail to be about 40%
smaller than they are on master, no matter the details, even after
many hours.

I'm not seeing the problem when pgbench is run with a small scale
factor but with a high client count. pgbench doesn't have the benefit
of much smaller indexes, so it also doesn't bear any cost when
contention is ramped up. The pgbench_accounts primary key (which is by
far the largest index) is *precisely* the same size as it is on
master, though the other indexes do seem to be a lot smaller. They
were already tiny, though. OTOH, the TPC-C NEW_ORDER transaction does
a lot of straight inserts, localized by warehouse, with skewed access.

[1] https://youtu.be/qYeRHK6oq7g?t=1340
[2] https://www.postgresql.org/message-id/CAH2-WzmsK-1qVR8xC86DXv8U0cHwfPcuH6hhA740fCeEu3XsVg@mail.gmail.com

--
Peter Geoghegan

Attachment Content-Type Size
7000_wh_4_hours_patch_2.svg image/svg+xml 97.2 KB
7000_wh_4_hours_master_2.svg image/svg+xml 97.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2019-03-12 02:52:23 Re: WIP: Avoid creation of the free space map for small tables
Previous Message Michael Paquier 2019-03-12 02:39:15 Re: Special role for subscriptions