Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
Date: 2019-10-01 02:39:34
Message-ID: CAH2-Wzmww9kM99iK42nQmXONg564DPseuUDy_mSAi3OeaxU07A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Sep 27, 2019 at 7:02 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> I think that it makes sense to enable deduplication by default -- even
> with unique indexes. It looks like deduplication can be very helpful
> with non-HOT updates.

Attached is v20, which adds a custom strategy for the checkingunique
(unique index) case to _bt_dedup_one_page(). It also makes
deduplication the default for both unique and non-unique indexes. I
simply altered your new BtreeDefaultDoDedup() macro from v19 to make
nbtree use deduplication wherever it is safe to do so. This default
may not be the best one in the end, though deduplication in unique
indexes now looks very compelling.

The new checkingunique heuristics added to _bt_dedup_one_page() were
developed experimentally, based on pgbench tests. The general idea
with the new checkingunique stuff is to make deduplication *extremely*
lazy. We want to avoid making _bt_vacuum_one_page() garbage collection
less effective by being too aggressive with deduplication -- workloads
with lots of non-HOT-updates into unique indexes are greatly dependent
on the LP_DEAD bit setting in _bt_check_unique(). At the same time,
_bt_dedup_one_page() can be just as effective at delaying page splits
as it is with non-unique indexes.

I've found that my "regular pgbench, but with a low cardinality index
on pgbench_accounts(abalance)" benchmark works best with the specific
heuristics used in the patch, especially over many hours. I spent
nearly 24 hours running the test at full speed (no throttling this
time), at scale 500, and with very very aggressive autovacuum settings
(autovacuum_vacuum_cost_delay=0ms,
autovacuum_vacuum_scale_factor=0.02). Each run lasted one hour, with
alternating runs of 4, 8, and 16 clients. Towards the end, the patch
had about 5% greater throughput at lower client counts, and never
seemed to be significantly slower (it was very slightly slower once or
twice, but I think that that was just noise).

More importantly, the indexes looked like this on master:

bloated_abalance: 3017 MB
pgbench_accounts_pkey: 2142 MB
pgbench_branches_pkey: 1352 kB
pgbench_tellers_pkey: 3416 kB

And like this with the patch:

bloated_abalance: 1015 MB
pgbench_accounts_pkey: 1745 MB
pgbench_branches_pkey: 296 kB
pgbench_tellers_pkey: 888 kB

* bloated_abalance is about 3x smaller here, as usual -- no surprises there.

* pgbench_accounts_pkey is the most interesting case.

You might think that it isn't that great that pgbench_accounts_pkey is
1745 MB with the patch, since it starts out at only 1071 MB (and would
go back down to 1071 MB again if we were to do a REINDEX). However,
you have to bear in mind that it takes a long time for it to get that
big -- the growth over time is very important here. Even after the
first run with 16 clients, it only reached 1160 MB -- that's an
increase of ~8%. The master case had already reached 2142 MB ("bloat
saturation point") by then, though. I could easily have stopped the
benchmark there, or used rate-limiting, or excluded the 16 client case
-- that would have allowed me to claim that the growth was under 10%
for a workload where the master case has an index that doubles in
size. On the other hand, if autovacuum wasn't configured to run very
frequently, then the patch wouldn't look nearly this good.
Deduplication helped autovacuum by "soaking up" the "recently dead"
index tuples that cannot be killed right away. In short, the patch
ameliorates weaknesses of the existing garbage collection mechanisms
without changing them. The patch smoothed out the growth of
pgbench_accounts_pkey over many hours. As I said, it was only 1160 MB
after the first 3 hours/first 16 client run. It was 1356 MB after the
second 16 client run (i.e. after running another round of one hour
4/8/16 client runs), finally finishing up at 1745 MB. So the growth in
the size of pgbench_accounts_pkey for the patch was significantly
improved, and the *rate* of growth over time was also improved.

The master branch had a terrible jerky growth in the size of
pgbench_accounts_pkey. The master branch did mostly keep up at first
(i.e. the size of pgbench_accounts_pkey wasn't too different at
first). But once we got to 16 clients for the first time, after a
couple of hours, pgbench_accounts_pkey almost doubled in size over a
period of only 10 or 20 minutes! The index size *exploded* in a very
short period of time, starting only a few hours into the benchmark.
(Maybe we don't see this anything like this with the patch because
with the patch backends are more concerned about helping VACUUM, and
less concerned about creating a mess that VACUUM must clean up. Not
sure.)

* We also manage to make the small pgbench indexes
(pgbench_branches_pkey and pgbench_tellers_pkey) over 4x smaller here
(without doing anything to force more non-HOT updates on the
underlying tables).

This result for the two small indexes looks good, but I should point
out that we still only fit ~15 or so tuples on each leaf page with the
patch when everything is over -- far far less than the number that
CREATE INDEX stored on the leaf pages immediately (it leaves 366 items
on each leaf page). This is kind of an extreme case, because there is
so much contention, but space utilization with the patch is actually
very bad here. The master branch is very very very bad, though, so
we're at least down to only a single "very" here. Progress.

Any thoughts on the approach taken for unique indexes within
_bt_dedup_one_page() in v20? Obviously that stuff needs to be examined
critically -- it's possible that it wouldn't do as well as it could or
should with other workloads that I haven't thought about. Please take
a look at the details.

--
Peter Geoghegan

Attachment Content-Type Size
v20-0002-DEBUG-Add-pageinspect-instrumentation.patch application/octet-stream 8.6 KB
v20-0001-Add-deduplication-to-nbtree.patch application/octet-stream 157.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2019-10-01 02:55:26 Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock(PG10.7)
Previous Message Amit Kapila 2019-10-01 02:33:44 Re: Commit fest 2019-09