Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
Date: 2019-09-11 12:37:59
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

09.09.2019 22:54, Peter Geoghegan wrote:
> Attached is v11, which makes the kill_prior_tuple optimization work
> with posting list tuples. The only catch is that it can only work when
> all "logical tuples" within a posting list are known-dead, since of
> course there is only one LP_DEAD bit available for each posting list.
> The hardest part of this kill_prior_tuple work was writing the new
> _bt_killitems() code, which I'm still not 100% happy with. Still, it
> seems to work well -- new pageinspect LP_DEAD status info was added to
> the second patch to verify that we're setting LP_DEAD bits as needed
> for posting list tuples. I also had to add a new nbtree-specific,
> posting-list-aware version of index_compute_xid_horizon_for_tuples()
> -- _bt_compute_xid_horizon_for_tuples(). Finally, it was necessary to
> avoid splitting a posting list with the LP_DEAD bit set. I took a
> naive approach to avoiding that problem, adding code to
> _bt_findinsertloc() to prevent it. Posting list splits are generally
> assumed to be rare, so the fact that this is slightly inefficient
> should be fine IMV.
> I also refactored deduplication itself in anticipation of making the
> WAL logging more efficient, and incremental. So, the structure of the
> code within _bt_dedup_one_page() was simplified, without really
> changing it very much (I think). I also fixed a bug in
> _bt_dedup_one_page(). The check for dead items was broken in previous
> versions, because the loop examined the high key tuple in every
> iteration.
> Making _bt_dedup_one_page() more efficient and incremental is still
> the most important open item for the patch.

Hi, thank you for the fixes and improvements.
I reviewed them and everything looks good except the idea of not
splitting dead posting tuples.
According to comments to scan->ignore_killed_tuples in genam.c:107,
it may lead to incorrect tuple order on a replica.
I don't sure, if it leads to any real problem, though, or it will be
by subsequent visibility checks. Anyway, it's worth to add more comments in
_bt_killitems() explaining why it's safe.

Attached is v12, which contains WAL optimizations for posting split and
deduplication. Changes to prior version:

* xl_btree_split record doesn't contain posting tuple anymore, instead
it keeps
'in_posting offset'  and repeats the logic of _bt_insertonpg() as you

* I introduced new xlog record XLOG_BTREE_DEDUP_PAGE, which contains
info about
groups of tuples deduplicated into posting tuples. In principle, it is
to fit it into some existing record, but I preferred to keep things clear.

I haven't measured how these changes affect WAL size yet.
Do you have any suggestions on how to automate testing of new WAL records?
Is there any suitable place in regression tests?

* I also noticed that _bt_dedup_one_page() can be optimized to return early
when none tuples were deduplicated. I wonder if we can introduce inner
statistic to tune deduplication? That is returning to the idea of
BT_COMPRESS_THRESHOLD, which can help to avoid extra work for pages that
very few duplicates or pages that are already full of posting lists.

To be honest, I don't believe that incremental deduplication can really
something, because no matter how many items were compressed we still
all items from the original page to the new one, so, why not do our best.
What do we save by this incremental approach?

Anastasia Lubennikova
Postgres Professional:
The Russian Postgres Company

Attachment Content-Type Size
v12-0001-Add-deduplication-to-nbtree.patch text/x-patch 125.5 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2019-09-11 12:38:13 Re: pgbench - allow to create partitioned tables
Previous Message Michael Meskes 2019-09-11 12:32:05 Re: A problem presentaion about ECPG, DECLARE STATEMENT