Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
Date: 2019-09-28 02:02:15
Message-ID: CAH2-WzmkUseObezytp_1RPoZ==TbOj6gvc_DuTsVPbCM3Wb05Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Sep 27, 2019 at 9:43 AM Anastasia Lubennikova
<a(dot)lubennikova(at)postgrespro(dot)ru> wrote:
> Attached is v19.

Cool.

> * By default deduplication is on for non-unique indexes and off for
> unique ones.

I think that it makes sense to enable deduplication by default -- even
with unique indexes. It looks like deduplication can be very helpful
with non-HOT updates. I have been benchmarking this using more or less
standard pgbench at scale 200, with one big difference -- I also
create an index on "pgbench_accounts (abalance)". This is a low
cardinality index, which ends up about 3x smaller with the patch, as
expected. It also makes all updates non-HOT updates, greatly
increasing index bloat in the primary key of the accounts table --
this is what I found really interesting about this workload.

The theory behind deduplication within unique indexes seems quite
different to the cases we've focussed on so far -- that's why my
working copy of the patch makes a few small changes to how
_bt_dedup_one_page() works with unique indexes specifically (more on
that later). With unique indexes, deduplication doesn't help by
creating space -- it helps by creating *time* for garbage collection
to run before the real "damage" is done -- it delays page splits. This
is only truly valuable when page splits caused by non-HOT updates are
delayed by so much that they're actually prevented entirely, typically
because the _bt_vacuum_one_page() stuff can now happen before pages
split, not after. In general, these page splits are bad because they
degrade the B-Tree structure, more or less permanently (it's certainly
permanent with this workload). Having a huge number of page splits
*purely* because of non-HOT updates is particular bad -- it's just
awful. I believe that this is the single biggest problem with the
Postgres approach to versioned storage (we know that other DB systems
have no primary key page splits with this kind of workload).

Anyway, if you run this pgbench workload without rate-limiting, so
that a patched Postgres does as much work as physically possible, the
accounts table primary key (pgbench_accounts_pkey) at least grows at a
slower rate -- the patch clearly beats master at the start of the
benchmark/test (as measured by index size). As the clients are ramped
up by my testing script, and as time goes on, eventually the size of
the pgbench_accounts_pkey index "catches up" with master. The patch
delays page splits, but ultimately the system as a whole cannot
prevent the page splits altogether, since the server is being
absolutely hammered by pgbench. Actually, the index is *exactly* the
same size for both the master case and the patch case when we reach
this "bloat saturation point". We can delay the problem, but we cannot
prevent it. But what about a more realistic workload, with
rate-limiting?

When I add some rate limiting, so that the TPS/throughput is at about
50% of what I got the first time around (i.e. 50% of what is
possible), or 15k TPS, it's very different. _bt_dedup_one_page() can
now effectively cooperate with _bt_vacuum_one_page(). Now
deduplication is able to "soak up all the extra garbage tuples" for
long enough to delay and ultimately *prevent* almost all page splits.
pgbench_accounts_pkey starts off at 428 MB for both master and patch
(CREATE INDEX makes it that size). After about an hour, the index is
447 MB with the patch. The master case ends up with a
pgbench_accounts_pkey size of 854 MB, though (this is very close to
857 MB, the "saturation point" index size from before).

This is a very significant improvement, obviously -- the patch has an
index that is ~52% of the size seen for the same index with the master
branch!

Here is how I changed _bt_dedup_one_page() for unique indexes to get
this result:

* We limit the size of posting lists to 5 heap TIDs in the
checkingunique case. Right now, we will actually accept a
checkingunique page split before we'll merge together items that
result in a posting list with more heap TIDs than that (not sure about
these details at all, though).

* Avoid creating a new posting list that caller will have to split
immediately anyway (this is based on details of _bt_dedup_one_page()
caller's newitem tuple).

(Not sure how much this customization contributes to this favorable
test result -- maybe it doesn't make that much difference.)

The goal here is for duplicates that are close together in both time
and space to get "clumped together" into their own distinct, small-ish
posting list tuples with no more than 5 TIDs. This is intended to help
_bt_vacuum_one_page(), which is known to be a very important mechanism
for indexes like our pgbench_accounts_pkey index (LP_DEAD bits are set
very frequently within _bt_check_unique()). The general idea is to
balance deduplication against LP_DEAD killing, and to increase
spatial/temporal locality within these smaller posting lists. If we
have one huge posting list for each value, then we can't set the
LP_DEAD bit on anything at all, which is very bad. If we have a few
posting lists that are not so big for each distinct value, we can
often kill most of them within _bt_vacuum_one_page(), which is very
good, and has minimal downside (i.e. we still get most of the benefits
of aggressive deduplication).

Interestingly, these non-HOT page splits all seem to "come in waves".
I noticed this because I carefully monitored the benchmark/test case
over time. The patch doesn't prevent the "waves of page splits"
pattern, but it does make it much much less noticeable.

> * New function _bt_dedup_is_possible() is intended to be a single place
> to perform all the checks. Now it's just a stub to ensure that it works.
>
> Is there a way to extract this from existing opclass information,
> or we need to add new opclass field? Have you already started this work?
> I recall there was another thread, but didn't manage to find it.

The thread is here:
https://www.postgresql.org/message-id/flat/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ(at)mail(dot)gmail(dot)com

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joe Nelson 2019-09-28 02:35:53 Re: Change atoi to strtol in same place
Previous Message Nikita Glukhov 2019-09-28 01:50:34 Re: SQL/JSON: JSON_TABLE