Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
Date: 2019-09-16 15:48:44
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

13.09.2019 4:04, Peter Geoghegan wrote:
> On Wed, Sep 11, 2019 at 2:04 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>> I think that the new WAL record has to be created once per posting
>> list that is generated, not once per page that is deduplicated --
>> that's the only way that I can see that avoids a huge increase in
>> total WAL volume. Even if we assume that I am wrong about there being
>> value in making deduplication incremental, it is still necessary to
>> make the WAL-logging behave incrementally.
> It would be good to hear your thoughts on this _bt_dedup_one_page()
> WAL volume/"write amplification" issue.
Attached is v14 based on v12 (v13 changes are not merged).

In this version, I fixed the bug you mentioned and also fixed nbtinsert,
so that it doesn't save newposting in xlog record anymore.

I tested patch with nbtree_wal_test, and found out that the real issue is
not the dedup WAL records themselves, but the full page writes that they
Here are test results (config is standard, except fsync=off to speedup

'FPW on' and 'FPW off' are tests on v14.
NO_IMAGE is the test on v14 with REGBUF_NO_IMAGE in bt_dedup_one_page().


|        ---        |   FPW on  |  FPW off  | FORCE_NO_IMAGE |   master  |


| time              | 09:12 min | 06:56 min | 06:24 min      | 08:10 min |

| nbtree_wal_volume | 8083 MB   | 2128 MB   | 2327 MB        | 2439 MB   |

| index_size        | 169 MB    | 169 MB    | 169 MB         | 1118 MB   |


With random insertions into btree it's highly possible that
deduplication will often be
the first write after checkpoint, and thus will trigger FPW, even if
only a few tuples were compressed.
That's why there is no significant difference with log_newpage_buffer()
And that's why "lazy" deduplication doesn't help to decrease amount of WAL.

Also, since the index is packed way better than before, it probably
benefits less of wal_compression.

One possible "fix" to decrease WAL amplification is to add
REGBUF_NO_IMAGE flag to XLogRegisterBuffer in bt_dedup_one_page().
As you can see from test result, it really eliminates the problem of
inadequate WAL amount.
However, I doubt that it is a crash-safe idea.

Another, and more realistic approach is to make deduplication less
if freed space is less than some threshold, fall back to not changing
page at all and not generating xlog record.

Probably that was the reason, why patch became faster after I added
BT_COMPRESS_THRESHOLD in early versions,
not because deduplication itself is cpu bound or something, but because
WAL load decreased.

So I propose to develop this idea. The question is how to choose threshold.
I wouldn't like to introduce new user settings.  Any ideas?

I also noticed that the number of checkpoints differ between tests:
select checkpoints_req from pg_stat_bgwriter ;


|       ---       |  FPW on | FPW off | FORCE_NO_IMAGE | master |


| checkpoints_req |      16 |       7 |              8 |     10 |


And I struggle to explain the reason of this.
Do you understand what can cause the difference?

Anastasia Lubennikova
Postgres Professional:
The Russian Postgres Company

Attachment Content-Type Size
v14-0001-Add-deduplication-to-nbtree.patch text/x-patch 126.4 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2019-09-16 15:52:56 Re: block-level incremental backup
Previous Message Tom Lane 2019-09-16 15:20:08 Re: Define jsonpath functions as stable