Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
Date: 2019-07-11 15:02:04
Message-ID: CAPpHfdtAjvfMcNp6h=JcNwv8rpMcuCXXq9NOJ6PYX-Jdxf0OJA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Peter,

Thank you very much for your attention to this patch. Let me comment
some points of your message.

On Sun, Jul 7, 2019 at 2:09 AM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> On Thu, Jul 4, 2019 at 5:06 AM Anastasia Lubennikova
> <a(dot)lubennikova(at)postgrespro(dot)ru> wrote:
> > The new version of the patch is attached.
> > This version is even simpler than the previous one,
> > thanks to the recent btree design changes and all the feedback I received.
> > I consider it ready for review and testing.
>
> I took a closer look at this patch, and have some general thoughts on
> its design, and specific feedback on the implementation.
>
> Preserving the *logical contents* of B-Tree indexes that use
> compression is very important -- that should not change in a way that
> outside code can notice. The heap TID itself should count as logical
> contents here, since we want to be able to implement retail index
> tuple deletion in the future. Even without retail index tuple
> deletion, amcheck's "rootdescend" verification assumes that it can
> find one specific tuple (which could now just be one specific "logical
> tuple") using specific key values from the heap, including the heap
> tuple's heap TID. This requirement makes things a bit harder for your
> patch, because you have to deal with one or two edge-cases that you
> currently don't handle: insertion of new duplicates that fall inside
> the min/max range of some existing posting list. That should be rare
> enough in practice, so the performance penalty won't be too bad. This
> probably means that code within _bt_findinsertloc() and/or
> _bt_binsrch_insert() will need to think about a logical tuple as a
> distinct thing from a physical tuple, though that won't be necessary
> in most places.

Could you please elaborate more on preserving the logical contents? I
can understand it as following: "B-Tree should have the same structure
and invariants as if each TID in posting list be a separate tuple".
So, if we imagine each TID to become separate tuple it would be the
same B-tree, which just can magically sometimes store more tuples in
page. Is my understanding correct? But outside code will still
notice changes as soon as it directly accesses B-tree pages (like
contrib/amcheck does). Do you mean we need an API for accessing
logical B-tree tuples or something?

> The need to "preserve the logical contents" also means that the patch
> will need to recognize when indexes are not safe as a target for
> compression/deduplication (maybe we should call this feature
> deduplilcation, so it's clear how it differs from TOAST?). For
> example, if we have a case-insensitive ICU collation, then it is not
> okay to treat an opclass-equal pair of text strings that use the
> collation as having the same value when considering merging the two
> into one. You don't actually do that in the patch, but you also don't
> try to deal with the fact that such a pair of strings are equal, and
> so must have their final positions determined by the heap TID column
> (deduplication within _bt_compress_one_page() must respect that).
> Possibly equal-but-distinct values seems like a problem that's not
> worth truly fixing, but it will be necessary to store metadata about
> whether or not we're willing to do deduplication in the meta page,
> based on operator class and collation details. That seems like a
> restriction that we're just going to have to accept, though I'm not
> too worried about exactly what that will look like right now. We can
> work it out later.

I think in order to deduplicate "equal but distinct" values we need at
least to give up with index only scans. Because we have no
restriction that equal according to B-tree opclass values are same for
other operations and/or user output.

> I think that the need to be careful about the logical contents of
> indexes already causes bugs, even with "safe for compression" indexes.
> For example, I can sometimes see an assertion failure
> within_bt_truncate(), at the point where we check if heap TID values
> are safe:
>
> /*
> * Lehman and Yao require that the downlink to the right page, which is to
> * be inserted into the parent page in the second phase of a page split be
> * a strict lower bound on items on the right page, and a non-strict upper
> * bound for items on the left page. Assert that heap TIDs follow these
> * invariants, since a heap TID value is apparently needed as a
> * tiebreaker.
> */
> #ifndef DEBUG_NO_TRUNCATE
> Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
> BTreeTupleGetMinTID(firstright)) < 0);
> ...
>
> This bug is not that easy to see, but it will happen with a big index,
> even without updates or deletes. I think that this happens because
> compression can allow the "logical tuples" to be in the wrong heap TID
> order when there are multiple posting lists for the same value. As I
> said, I think that it's necessary to see a posting list as being
> comprised of multiple logical tuples in the context of inserting new
> tuples, even when you're not performing compression or splitting the
> page. I also see that amcheck's bt_index_parent_check() function
> fails, though bt_index_check() does not fail when I don't use any of
> its extra verification options. (You haven't updated amcheck, but I
> don't think that you need to update it for these basic checks to
> continue to work.)

Do I understand correctly that current patch may produce posting lists
of the same value with overlapping ranges of TIDs? If so, it's
definitely wrong.

> * Maybe we could do compression with unique indexes when inserting
> values with NULLs? Note that we now treat an insertion of a tuple with
> NULLs into a unique index as if it wasn't even a unique index -- see
> the "checkingunique" optimization at the beginning of _bt_doinsert().
> Having many NULL values in a unique index is probably fairly common.

I think unique indexes may benefit from deduplication not only because
of NULL values. Non-HOT updates produce duplicates of non-NULL values
in unique indexes. And those duplicates can take significant space.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2019-07-11 15:08:22 Re: [sqlsmith] Crash in mcv_get_match_bitmap
Previous Message Tomas Vondra 2019-07-11 14:59:32 Re: [sqlsmith] Crash in mcv_get_match_bitmap