Re: WIP: Covering + unique indexes.

From: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: WIP: Covering + unique indexes.
Date: 2018-04-01 17:09:38
Message-ID: CAPpHfdscXb8Rya=yBKTV_8qCnw=6CxcBLNH99t7j5THyPwDQpg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi, Peter!

On Sat, Mar 31, 2018 at 8:39 AM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:

> On Fri, Mar 30, 2018 at 4:08 PM, Alexander Korotkov
> <a(dot)korotkov(at)postgrespro(dot)ru> wrote:
> >> I'll try it. But I'm afraid that it's not as easy as you expect.
> >
> >
> > So, I have some implementation of storage of number of attributes inside
> > index tuple itself. I made it as additional patch on top of previous
> > patchset.
> > I attach the whole patchset in order to make commitfest.cputube.org
> happy.
>
> Looks like 0004-* became mangled. Can you send a version that is not
> mangled, please?
>

Oh, sorry for that. I forgot to remove some .orig and .rej files and they
accidentally appear in patch. Correct verion is attached.

> I decided not to use 13th bit of IndexTuple flags. Instead I use only
> high
> > bit
> > of offset which is also always free on regular tuples. In fact, we
> already
> > use
> > assumption that there is at most 11 significant bits of index tuple
> offset
> > in
> > GIN (see ginpostinglist.c).
>
> So? GIN doesn't have the same legacy at all. The GIN posting lists
> *don't* have regular heap TID pointers at all. They started out
> without them, and still don't have them.
>

Yes, GIN never stored heap TID pointers in t_tid of index tuple. But GIN
assumes that heap TID pointer has at most 11 significant bits during
posting list encoding.

> However, I find following arguments against implementing this feature
> > in covering indexes.
> >
> > * We write number of attributes present into tuple. But how to prove
> that
> > it's correct. I add appropriate checks to amcheck. But I don't think
> all
> > the
> > users runs amcheck frequent enough. Thus, in order to be sure that it's
> > correct we should check number of attributes is written correct
> everywhere
> > in the B-tree code.
>
> Use an assertion. Problem solved.
>

I don't think we should use assertions, because they are typically disabled
on
production PostgreSQL builds. But we can have some explicit check in some
common path. In the attached patch I've such check to _bt_compare().
Probably,
together with amcheck, that would be sufficient.

> * Offset number is used now for parent refind (see BTEntrySame() macro).
> > In the attached patch, this condition is relaxed. But I don't think I
> > really like
> > that. This shoud be thought out very carefully...
>
> It's safe, although I admit that that's a bit hard to see.
> Specifically, look at this code in _bt_insert_parent():
>
> /*
> * Find the parent buffer and get the parent page.
> *
> * Oops - if we were moved right then we need to change stack
> item! We
> * want to find parent pointing to where we are, right ? - vadim
> * 05/27/97
> */
> ItemPointerSet(&(stack->bts_btentry.t_tid), bknum, P_HIKEY);
> pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);
>
> Vadim doesn't seem too sure of why he did it that way. What's clear is
> that the offset on all internal pages is always P_HIKEY (that is, 1),
> because this is the one and only place where new IndexTuples get
> generated for internal pages. That's unambiguous. So how could
> BTEntrySame() truly need to care about offset? How could there ever be
> an internal page offset that wasn't just P_HIKEY? You can look
> yourself, using pg_hexedit or pageinspect.
>
> The comments above BTTidSame()/BTEntrySame() are actually wrong,
> including "New Comments". Vadim wanted to make TIDs part of the
> keyspace [1], beginning in around 1997. The idea was that we'd have
> truly unique keys by including TID, as L&Y intended, but that never
> happened. Instead, we got commit 9e85183bf in 2000, which among many
> other things changed the L&Y invariant to deal with duplicates. I
> think that Tom should have changed BTTidSame() to not care about
> offset number in that same commit from 2000.
>
> I actually think that Vadim was correct to want to make heap TID a
> unique-ifier, and that that's the best long term solution [2].
> Unfortunately, the code that he committed in the late 1990s didn't
> really help -- how could it help without including the *entire* heap
> TID? This BTTidSame() offset thing seems to be related to some weird
> logic for duplicates that Tom killed in 9e85183bf, if it ever made
> sense. Note that _bt_getstackbuf(), the only code that uses
> BTEntrySame(), does not look at the offset directly -- because it's
> always P_HIKEY.
>
> Anyway...

OK, thank for the explanation. I agree that check of offset is redundant
here.

> > * Now, hikeys are copied together with original t_tid's. That makes it
> > possible
> > to find the origin of this hikey. If we override offset in t_tid, that
> > becomes not
> > always possible.
>
> ....that just leaves the original high key at the leaf level, as you
> say here. You're right that there is theoretically a loss of forensic
> information from actually storing something in the offset at the leaf
> level, and storing something interesting in the offset during the
> first phase of a page split (not the second, where the aforementioned
> _bt_insert_parent() function gets called). I don't think it's worth
> worrying about, though.
>
> The fact is that that information can go out of date almost
> immediately, whereas high keys usually last forever. The only reason
> that there is a heap TID in the high key is because we'd have to add
> special code to remove it; not because it has any real value. I find
> it very hard to imagine it being used in a forensic situation. If you
> actually wanted to do this, the key itself is probably enough -- you
> probably wouldn't need the TID.
>

I don't know, When I wrote my own implementation of B-tree and debug
it, I found saving hikeys "as is" to be very valuable for debugging.
However, B-trees in PostgreSQL are quite mature, and probably
don't need so much debug information.

> > * When index tuple is truncated, then pageinspect probably shouldn't show
> > offset for it, because it meaningless. Should it rather show number of
> > attributes in separate column? Anyway that should be part of suffix
> > truncation
> > patch. Not part of covering indexes patch, especially added at the last
> > moment.
>
> Nobody asked you to write a suffix truncation patch. That has
> complexity above and beyond what the covering index patch needs. I
> just expect it to be compatible with an eventual suffix truncation
> patch, which you've now shown is quite possible. It is clearly a
> complimentary technique.
>

OK, but change of on-disk tuple format also changes what people
see in pageinspect. Right now, they see "1" as offset for tuples in intenal
page and hikeys. After patch, they would see some large values
(assuming we set some of hi bits) in offset. I'm not sure it's OK.
We probably should change display of index tuples in pageinspect.

> * I don't really see how does covering indexes without storing number of
> > index tuple attributes in the tuple itself blocks future work on suffix
> > truncation.
>
> It makes it harder. Your new version gives amcheck a way of
> determining the expected number of attributes. That's the main reason
> to have it, more so than the suffix truncation issue.

I'm sorry, I do not understand. New version of amcheck determines
the expected number of attributes and compares that to the numer of
attributes stored in the offset number. But I can get *expected* number of
attributes even wihtout storing them also in the offset number...

Suffix truncation matters a lot too, though.
>

Sure, that's great feature.

> So, taking into account the arguments of above, I propose to give up with
> > idea to stick covering indexes and suffix truncation features together.
> > That wouldn't accelerate appearance one feature after another, but rather
> > likely would RIP both of them...
>
> I think that the thing that's more likely to kill this patch is the
> fact that after the first year, it only ever got discussed in the
> final CF. That's not something that happened because of my choices. I
> made several offers of my time. I did not create this urgency.
>

I'm sorry my comment was only about particular feature which I'm not
biggest fan of (that doesn't mean I'm strictly against of it).

I'd like to note that I really appreciate your attention to this patch
as well as other patches.

Anyway, let me know whay do you think about
0004-Covering-natts-v10.patch attached.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment Content-Type Size
0001-Covering-core-v10.patch application/octet-stream 108.1 KB
0002-Covering-btree-v10.patch application/octet-stream 58.8 KB
0003-Covering-amcheck-v10.patch application/octet-stream 4.3 KB
0004-Covering-natts-v10.patch application/octet-stream 15.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2018-04-01 17:13:55 Re: Diagonal storage model
Previous Message legrand legrand 2018-04-01 16:43:45 Re: Diagonal storage model