Re: Deleting older versions in unique indexes to avoid page splits

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Deleting older versions in unique indexes to avoid page splits
Date: 2020-10-26 21:15:03
Message-ID: CAH2-Wzkaj6OhVSvB46t_ybosd0cH1Qaiao+0dSopGUDmEM37Hw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Oct 24, 2020 at 8:01 AM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> There probably are some workloads with worse latency and throughput,
> but generally only with high contention/small indexes. I'll try to
> fine tune those, but some amount of it is probably inevitable. On
> average query latency is quite a lot lower with the patch (where it is
> affected at all - the mechanism is only used with non-hot updates).

Attached is v5, which has changes that are focused on two important
high level goals:

1. Keeping costs down generally, especially in high contention cases
where costs are most noticeable.

2. Making indexes on low cardinality columns that naturally really
benefit from merge deduplication in Postgres 13 receive largely the
same benefits that you've already seen with high cardinality indexes
(at least outside of the extreme cases where it isn't sensible to
try).

CPU costs (especially from sorting temp work arrays) seem to be the
big cost overall. It turns out that the costs of accessing heap pages
within the new mechanism is not really noticeable. This isn't all that
surprising, though. The heap pages are accessed in a way that
naturally exploits locality across would-be page splits in different
indexes.

To some degree the two goals that I describe conflict with each other.
If merge deduplication increases the number of logical rows that "fit
on each leaf page" (by increasing the initial number of TIDs on each
leaf page by over 3x when the index is in the pristine CREATE INDEX
state), then naturally the average amount of work required to maintain
indexes in their pristine state is increased. We cannot expect to pay
nothing to avoid "unnecessary page splits" -- we can only expect to
come out ahead over time.

The main new thing that allowed me to more or less accomplish the
second goal is granular deletion in posting list tuples. That is, v5
teaches the delete mechanism to do VACUUM-style granular TID deletion
within posting lists. This factor turned out to matter a lot.

Here is an extreme benchmark that I ran this morning for this patch,
which shows both strengths and weaknesses:

pgbench scale 1000 (with customizations to indexing that I go into
below), 16 + 32 clients, 30 minutes per run (so 2 hours total
excluding initial data loading). Same queries as last time:

\set aid1 random_gaussian(1, 100000 * :scale, 4.0)
\set aid2 random_gaussian(1, 100000 * :scale, 4.5)
\set aid3 random_gaussian(1, 100000 * :scale, 4.2)
\set bid random(1, 1 * :scale)
\set tid random(1, 10 * :scale)
\set delta random(-5000, 5000)
BEGIN;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid1;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid2;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid3;
END;

And just like last time, we replace the usual pgbench_accounts index
with an INCLUDE index to artificially make every UPDATE a non-HOT
UPDATE:

create unique index aid_pkey_include_abalance on pgbench_accounts
(aid) include(abalance);

Unlike last time we have a variety of useless-to-us *low cardinality*
indexes. These are much smaller than aid_pkey_include_abalance due to
merge deduplication, but are still quite similar to it:

create index fiver on pgbench_accounts ((aid - (aid%5)));
create index tenner on pgbench_accounts ((aid - (aid%10)));
create index score on pgbench_accounts ((aid - (aid%20)));

The silly indexes this time around are designed to have the same skew
as the PK, but with low cardinality data. So, for example "score", has
twenty distinct logical rows for each distinct aid value. Which is
pretty extreme as far as the delete mechanism is concerned. That's why
this is more of a mixed picture compared to the earlier benchmark. I'm
trying to really put the patch through its paces, not make it look
good.

First the good news. The patch held up perfectly in one important way
-- the size of the indexes didn't change at all compared to the
original pristine size. That looked like this at the start for both
patch + master:

aid_pkey_include_abalance: 274,194 pages/2142 MB
fiver: 142,348 pages/1112 MB
tenner: 115,352 pages/901 MB
score: 94,677 pages/740 MB

By the end, master looked like this:

aid_pkey_include_abalance: 428,759 pages (~1.56x original size)
fiver: 220,114 pages (~1.54x original size)
tenner: 176,983 pages (~1.53x original size)
score: 178,165 pages (~1.88x original size)

(As I said, no change in the size of indexes with the patch -- not
even one single page split.)

Now for the not-so-good news. The TPS numbers looked like this
(results in original chronological order of the runs, which I've
interleaved):

patch_1_run_16.out: "tps = 30452.530518 (including connections establishing)"
master_1_run_16.out: "tps = 35101.867559 (including connections establishing)"
patch_1_run_32.out: "tps = 26000.991486 (including connections establishing)"
master_1_run_32.out: "tps = 32582.129545 (including connections establishing)"

The latency numbers aren't great for the patch, either. Take the 16 client case:

number of transactions actually processed: 54814992
latency average = 0.525 ms
latency stddev = 0.326 ms
tps = 30452.530518 (including connections establishing)
tps = 30452.612796 (excluding connections establishing)
statement latencies in milliseconds:
0.001 \set aid1 random_gaussian(1, 100000 * :scale, 4.0)
0.000 \set aid2 random_gaussian(1, 100000 * :scale, 4.5)
0.000 \set aid3 random_gaussian(1, 100000 * :scale, 4.2)
0.000 \set bid random(1, 1 * :scale)
0.000 \set tid random(1, 10 * :scale)
0.000 \set delta random(-5000, 5000)
0.046 BEGIN;
0.159 UPDATE pgbench_accounts SET abalance = abalance +
:delta WHERE aid = :aid1;
0.153 SELECT abalance FROM pgbench_accounts WHERE aid = :aid2;
0.091 SELECT abalance FROM pgbench_accounts WHERE aid = :aid3;
0.075 END;

vs master's 16 client case:

number of transactions actually processed: 63183870
latency average = 0.455 ms
latency stddev = 0.307 ms
tps = 35101.867559 (including connections establishing)
tps = 35101.914573 (excluding connections establishing)
statement latencies in milliseconds:
0.001 \set aid1 random_gaussian(1, 100000 * :scale, 4.0)
0.000 \set aid2 random_gaussian(1, 100000 * :scale, 4.5)
0.000 \set aid3 random_gaussian(1, 100000 * :scale, 4.2)
0.000 \set bid random(1, 1 * :scale)
0.000 \set tid random(1, 10 * :scale)
0.000 \set delta random(-5000, 5000)
0.049 BEGIN;
0.117 UPDATE pgbench_accounts SET abalance = abalance +
:delta WHERE aid = :aid1;
0.120 SELECT abalance FROM pgbench_accounts WHERE aid = :aid2;
0.091 SELECT abalance FROM pgbench_accounts WHERE aid = :aid3;
0.077 END;

Earlier variations of this same workload that were run with a 10k TPS
rate limit had SELECT statements that had somewhat lower latency with
the patch, while the UPDATE statements were slower by roughly the same
amount as you see here. Here we see that at least the aid3 SELECT is
just as fast with the patch. (Don't have a proper example of this
rate-limited phenomenon close at hand now, since I saw it happen back
when the patch was less well optimized.)

This benchmark is designed to be extreme, and to really stress the
patch. For one thing we have absolutely no index scans for all but one
of the four indexes, so controlling the bloat there isn't going to
give you the main expected benefit. Which is that index scans don't
even have to visit heap pages from old versions, since they're not in
the index for very long (unless there aren't very many versions for
the logical rows in question, which isn't really a problem for us).
For another the new mechanism is constantly needed, which just isn't
very realistic. It seems as if the cost is mostly paid by non-HOT
updaters, which seems exactly right to me.

I probably could have cheated by making the aid_pkey_include_abalance
index a non-unique index, denying the master branch the benefit of
LP_DEAD setting within _bt_check_unique(). Or I could have cheated
(perhaps I should just say "gone for something a bit more
sympathetic") by not having skew on all the indexes (say by hashing on
aid in the indexes that use merge deduplication). I also think that
the TPS gap would have been smaller if I'd spent more time on each
run, but I didn't have time for that today. In the past I've seen it
take a couple of hours or more for the advantages of the patch to come
through (it takes that long for reasons that should be obvious).

Even the overhead we see here is pretty tolerable IMV. I believe that
it will be far more common for the new mechanism to hardly get used at
all, and yet have a pretty outsized effect on index bloat. To give you
a simple example of how that can happen, consider that if this did
happen in a real workload it would probably be caused by a surge in
demand -- now we don't have to live with the bloat consequences of an
isolated event forever (or until the next REINDEX). I can make more
sophisticated arguments than that one, but it doesn't seem useful
right now so I'll refrain.

The patch adds a backstop. It seems to me that that's really what we
need here. Predictability over time and under a large variety of
different conditions. Real workloads constantly fluctuate.

Even if people end up not buying my argument that it's worth it for
workloads like this, there are various options. And, I bet I can
further improve the high contention cases without losing the valuable
part -- there are a number of ways in which I can get the CPU costs
down further that haven't been fully explored (yes, it really does
seem to be CPU costs, especially due to TID sorting). Again, this
patch is all about extreme pathological workloads, system stability,
and system efficiency over time -- it is not simply about increasing
system throughput. There are some aspects of this design (that come up
with extreme workloads) that may in the end come down to value
judgments. I'm not going to tell somebody that they're wrong for
prioritizing different things (within reason, at least). In my opinion
almost all of the problems we have with VACUUM are ultimately
stability problems, not performance problems per se. And, I suspect
that we do very well with stupid benchmarks like this compared to
other DB systems precisely because we currently allow non-HOT updaters
to "live beyond their means" (which could in theory be great if you
frame it a certain way that seems pretty absurd to me). This suggests
we can "afford" to go a bit slower here as far as the competitive
pressures determine what we should do (notice that this is a distinct
argument to my favorite argument, which is that we cannot afford to
*not* go a bit slower in certain extreme cases).

I welcome debate about this.

--
Peter Geoghegan

Attachment Content-Type Size
v5-0001-Add-sort_template.h-for-making-fast-sort-function.patch application/octet-stream 13.7 KB
v5-0002-Add-delete-deduplication-to-nbtree.patch application/octet-stream 86.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message John Hsu 2020-10-26 21:43:59 Re: Improve pg_dump dumping publication tables
Previous Message Bruce Momjian 2020-10-26 21:04:09 Re: Wrong example in the bloom documentation