Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
Date: 2019-09-13 01:04:25
Message-ID: CAH2-WzkjuaM7_aFrfrbPrmow4jakeQmQ=mrntKw_aA9OVvcsRg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Sep 11, 2019 at 2:04 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> I think that the new WAL record has to be created once per posting
> list that is generated, not once per page that is deduplicated --
> that's the only way that I can see that avoids a huge increase in
> total WAL volume. Even if we assume that I am wrong about there being
> value in making deduplication incremental, it is still necessary to
> make the WAL-logging behave incrementally.

Attached is v13 of the patch, which shows what I mean. You could say
that v13 makes _bt_dedup_one_page() do a few extra things that are
kind of similar to the things that nbtsplitloc.c does for _bt_split().

More specifically, the v13-0001-* patch includes code that makes
_bt_dedup_one_page() "goal orientated" -- it calculates how much space
will be freed when _bt_dedup_one_page() goes on to deduplicate those
items on the page that it has already "decided to deduplicate". The
v13-0002-* patch makes _bt_dedup_one_page() actually use this ability
-- it makes _bt_dedup_one_page() give up on deduplication when it is
clear that the items that are already "pending deduplication" will
free enough space for its caller to at least avoid a page split. This
revision of the patch doesn't truly make deduplication incremental. It
is only a proof of concept that shows how _bt_dedup_one_page() can
*decide* that it will free "enough" space, whatever that may mean, so
that it can finish early. The task of making _bt_dedup_one_page()
actually avoid lots of work when it finishes early remains.

As I said yesterday, I'm not asking you to accept that v13-0002-* is
an improvement. At least not yet. In fact, "finishes early" due to the
v13-0002-* logic clearly makes everything a lot slower, since
_bt_dedup_one_page() will "thrash" even more than earlier versions of
the patch. This is especially problematic with WAL-logged relations --
the test case that I shared yesterday goes from about 6GB to 10GB with
v13-0002-* applied. But we need to fundamentally rethink the approach
to the rewriting + WAL-logging by _bt_dedup_one_page() anyway. (Note
that total index space utilization is barely affected by the
v13-0002-* patch, so clearly that much works well.)

Other changes:

* Small tweaks to amcheck (nothing interesting, really).

* Small tweaks to the _bt_killitems() stuff.

* Moved all of the deduplication helper functions to nbtinsert.c. This
is where deduplication gets complicated, so I think that it should all
live there. (i.e. nbtsort.c will call nbtinsert.c code, never the
other way around.)

Note that I haven't merged any of the changes from v12 of the patch
from yesterday. I didn't merge the posting list WAL logging changes
because of the bug I reported, but I would have were it not for that.
The WAL logging for _bt_dedup_one_page() added to v12 didn't appear to
be more efficient than your original approach (i.e. calling
log_newpage_buffer()), so I have stuck with your original approach.

It would be good to hear your thoughts on this _bt_dedup_one_page()
WAL volume/"write amplification" issue.

--
Peter Geoghegan

Attachment Content-Type Size
v13-0002-Stop-deduplicating-when-a-page-split-is-avoided.patch application/octet-stream 1.8 KB
v13-0003-DEBUG-Add-pageinspect-instrumentation.patch application/octet-stream 8.6 KB
v13-0001-Add-deduplication-to-nbtree.patch application/octet-stream 119.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2019-09-13 03:17:46 Re: psql - improve test coverage from 41% to 88%
Previous Message Tsunakawa, Takayuki 2019-09-13 00:21:15 [bug fix??] Fishy code in tts_cirtual_copyslot()