Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
Date: 2019-08-29 12:13:39
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

28.08.2019 6:19, Peter Geoghegan wrote:
> On Fri, Aug 16, 2019 at 8:56 AM Anastasia Lubennikova
> <a(dot)lubennikova(at)postgrespro(dot)ru> wrote:
>> Now the algorithm is the following:
>> - In case page split is needed, pass both tuples to _bt_split().
>> _bt_findsplitloc() is now aware of upcoming replacement of origtup with
>> neworigtup, so it uses correct item size where needed.
>> It seems that now all replace operations are crash-safe. The new patch passes
>> all regression tests, so I think it's ready for review again.
> I think that the way this works within nbtsplitloc.c is too
> complicated. In v5, the only thing that nbtsplitloc.c knew about
> deduplication was that it could be sure that suffix truncation would
> at least make a posting list into a single heap TID in the worst case.
> This consideration was mostly about suffix truncation, not
> deduplication, which seemed like a good thing to me. _bt_split() and
> _bt_findsplitloc() should know as little as possible about posting
> lists.
> Obviously it will sometimes be necessary to deal with the case where a
> posting list is about to become too big (i.e. it's about to go over
> BTMaxItemSize()), and so must be split. Less often, a page split will
> be needed because of one of these posting list splits. These are two
> complicated areas (posting list splits and page splits), and it would
> be a good idea to find a way to separate them as much as possible.
> Remember, nbtsplitloc.c works by pretending that the new item that
> cannot fit on the page is already on its own imaginary version of the
> page that *can* fit the new item, along with everything else from the
> original/actual page. That gets *way* too complicated when it has to
> deal with the fact that the new item is being merged with an existing
> item. Perhaps nbtsplitloc.c could also "pretend" that the new item is
> always a plain tuple, without knowing anything about posting lists.
> Almost like how it worked in v5.
> We always want posting lists to be as close to the BTMaxItemSize()
> size as possible, because that helps with space utilization. In v5 of
> the patch, this was what happened, because, in effect, we didn't try
> to do anything complicated with the new item. This worked well, apart
> from the crash safety issue. Maybe we can simulate the v5 approach,
> giving us the best of all worlds (good space utilization, simplicity,
> and crash safety). Something like this:
> * Posting list splits should always result in one posting list that is
> at or just under BTMaxItemSize() in size, plus one plain tuple to its
> immediate right on the page. This is similar to the more common case
> where we cannot add additional tuples to a posting list due to the
> BTMaxItemSize() restriction, and so end up with a single tuple (or a
> smaller posting list with the same value) to the right of a
> BTMaxItemSize()-sized posting list tuple. I don't see a reason to
> split a posting list in the middle -- we should always split to the
> right, leaving the posting list as large as possible.
> * When there is a simple posting list split, with no page split, the
> logic required is fairly straightforward: We rewrite the posting list
> in-place so that our new item goes wherever it belongs in the existing
> posting list on the page (we memmove() the posting list to make space
> for the new TID, basically). The old last/rightmost TID in the
> original posting list becomes a new, plain tuple. We may need a new
> WAL record for this, but it's not that different to a regular leaf
> page insert.
> * When this happens to result in a page split, we then have a "fake"
> new item -- the right half of the posting list that we split, which is
> always a plain item. Obviously we need to be a bit careful with the
> WAL logging, but the space accounting within _bt_split() and
> _bt_findsplitloc() can work just the same as now. nbtsplitloc.c can
> work like it did in v5, when the only thing it knew about posting
> lists was that _bt_truncate() always removes them, maybe leaving a
> single TID behind in the new high key. (Note also that it's not okay
> to remove the conservative assumption about at least having space for
> one heap TID within _bt_recsplitloc() -- that needs to be restored to
> its v5 state in the next version of the patch.)
> Because deduplication is lazy, there is little value in doing
> deduplication of the new item (which may or may not be the fake new
> item). The nbtsplitloc.c logic will "trap" duplicates on the same page
> today, so we can just let deduplication of the new item happen at a
> later time. _bt_split() can almost pretend that posting lists don't
> exist, and nbtsplitloc.c needs to know nothing about posting lists
> (apart from the way that _bt_truncate() behaves with posting lists).
> We "lie" to _bt_findsplitloc(), and tell it that the new item is our
> fake new item -- it doesn't do anything that will be broken by that
> lie, because it doesn't care about the actual content of posting
> lists. And, we can fix the "fake new item is not actually real new
> item" issue at one point within _bt_split(), just as we're about to
> WAL log.
> What do you think of that approach?

I think it's a good idea. Thank you for such a detailed description of
cases. I already started to simplify this code, while debugging amcheck
in v8. At first, I rewrote it to split posting tuple into a posting and a
regular tuple instead of two posting tuples.

Your explanation helped me to understand that this approach can be
extended to
the case of insertion into posting list, that doesn't trigger posting
and that nbtsplitloc indeed doesn't need to know about posting tuples
The code is much cleaner now.

The new version is attached. It passes regression tests. I also run land
tpch test. They pass amcheck rootdescend and if I interpreted results
correctly, the new version shows slightly better compression.
 tpch      | anastasia | UTF8     | ru_RU.UTF-8 | ru_RU.UTF-8 | | 31
GB   | pg_default |
 land      | anastasia | UTF8     | ru_RU.UTF-8 | ru_RU.UTF-8 | | 6380
MB | pg_default |

Some individual indexes are larger, some are smaller compared to the
expected output.

This patch is based on v6, so it again contains "compression" instead of
in variable names and comments. I will rename them when code becomes
more stable.

Anastasia Lubennikova
Postgres Professional:
The Russian Postgres Company

Attachment Content-Type Size
v9-0001-Compression-deduplication-in-nbtree.patch text/x-patch 102.7 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message fn ln 2019-08-29 12:35:43 Re: BUG #15977: Inconsistent behavior in chained transactions
Previous Message Fabien COELHO 2019-08-29 12:10:06 Re: BUG #15977: Inconsistent behavior in chained transactions