Re: WIP: Covering + unique indexes.

From: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: WIP: Covering + unique indexes.
Date: 2018-04-02 23:27:33
Message-ID: CAPpHfdva7aS=DMGPXKe8fV5fe3w8tXzzREndZvgaW+gWvpsT7w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Apr 2, 2018 at 1:18 AM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:

> On Sun, Apr 1, 2018 at 10:09 AM, Alexander Korotkov
> <a(dot)korotkov(at)postgrespro(dot)ru> wrote:
> >> So? GIN doesn't have the same legacy at all. The GIN posting lists
> >> *don't* have regular heap TID pointers at all. They started out
> >> without them, and still don't have them.
> >
> >
> > Yes, GIN never stored heap TID pointers in t_tid of index tuple. But GIN
> > assumes that heap TID pointer has at most 11 significant bits during
> > posting list encoding.
>
> I think that we should avoid assuming things, unless the cost of
> representing them is too high, which I don't think applies here. The
> more defensive general purpose code can be, the better.
>

I thought abut that another time and I decided that it would be safer
to use 13th bit in index tuple flags. There are already attempt to
use whole 6 bytes of tid for not heap pointer information [1]. Thus, it
would be safe to use 13th bit for indicating alternative offset meaning
in pivot tuples, because it wouldn't block further work. Revised patchset
in the attachment implements it.

> I don't think we should use assertions, because they are typically
> disabled
> > on
> > production PostgreSQL builds. But we can have some explicit check in
> some
> > common path. In the attached patch I've such check to _bt_compare().
> > Probably,
> > together with amcheck, that would be sufficient.
>
> Good idea -- a "can't happen" check in _bt_compare seems better, which
> I see here:
>
> > diff --git a/src/backend/access/nbtree/nbtsearch.c
> b/src/backend/access/nbtree/nbtsearch.c
> > index 51dca64e13..fcf9832147 100644
> > --- a/src/backend/access/nbtree/nbtsearch.c
> > +++ b/src/backend/access/nbtree/nbtsearch.c
> > @@ -443,6 +443,17 @@ _bt_compare(Relation rel,
> > if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
> > return 1;
> >
> > + /*
> > + * Check tuple has correct number of attributes.
> > + */
> > + if (!_bt_check_natts(rel, page, offnum))
> > + {
> > + ereport(ERROR,
> > + (errcode(ERRCODE_INTERNAL_ERROR),
> > + errmsg("tuple has wrong number of attributes in index
> \"%s\"",
> > + RelationGetRelationName(rel))));
> > + }
> > +
> > itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
>
> It seems like it might be a good idea to make this accept an
> IndexTuple, though, to possibly save some work.

I don't know. We still need an offset number to check expected number
of attributes. Passing index tuple as separate attribute would be
redundant and open door for extra possible errors.

> lso, perhaps this
> should be an unlikely() condition, if only because it makes the intent
> clearer (might actually matter in a tight loop like this too, though).
>

OK, marked that check as unlikely().

Do you store an attribute number in the "minus infinity" item (the
> leftmost one of internal pages)? I guess that that should be zero,
> because it's totally truncated.
>

Yes, I store zero number of attributes in "minus infinity" item. See this
part of the patch.

@@ -2081,7 +2081,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
left_item_sz = sizeof(IndexTupleData);
left_item = (IndexTuple) palloc(left_item_sz);
left_item->t_info = left_item_sz;
- ItemPointerSet(&(left_item->t_tid), lbkno, P_HIKEY);
+ ItemPointerSetBlockNumber(&(left_item->t_tid), lbkno);
+ BTreeTupSetNAtts(left_item, 0);

However, note that I've to store (number_of_attributes + 1) in the offset
in order to correctly store zero number of attributes. Otherwise, assertion
is faised in ItemPointerIsValid() macro.

/*
* ItemPointerIsValid
* True iff the disk item pointer is not NULL.
*/
#define ItemPointerIsValid(pointer) \
((bool) (PointerIsValid(pointer) && ((pointer)->ip_posid != 0)))

>> The fact is that that information can go out of date almost
> >> immediately, whereas high keys usually last forever. The only reason
> >> that there is a heap TID in the high key is because we'd have to add
> >> special code to remove it; not because it has any real value. I find
> >> it very hard to imagine it being used in a forensic situation. If you
> >> actually wanted to do this, the key itself is probably enough -- you
> >> probably wouldn't need the TID.
> >
> >
> > I don't know, When I wrote my own implementation of B-tree and debug
> > it, I found saving hikeys "as is" to be very valuable for debugging.
>
> I would like to see your implementation at some point. That sounds
> interesting.
>

It was in-memory B-tree [2]. Will be published one day.

> > However, B-trees in PostgreSQL are quite mature, and probably
> > don't need so much debug information.
>
> Today, the highkey at the leaf level is an exact copy of the right
> sibling's first item immediately after the split. The absence of a
> usable heap TID offset (due to using it for number of attributes in
> high keys) is unlikely to make it harder to locate that right
> sibling's first item (to get a full heap TID), which could have moved
> a lot further right after the split, or even have been removed
> entirely. It could now be ambiguous where it wouldn't have been before
> in the event of duplicates, but it's unlikely. And when it does
> happen, it's unlikely to matter.
>
> We can still include the heap block number, I suppose. I think of the
> highkey as only having one simple job -- separating the keyspace
> between siblings. We actually have a very neat choke point to check
> that it does that one job -- when a high key is generated for a page
> split at the leaf level. If we were doing generic suffix truncation,
> we'd add a test that made sure that the high key was strictly greater
> than the last item on the left, and strictly less than the first item
> on the right. As I said yesterday, I don't like how we allow a highkey
> to be equal to both sides of the split, which goes against L&Y, and I
> think that we would at least be strict about < and > for suffix
> truncation.
>
> The highkey's actual value can be invented, provided it does this one
> simple job, which needs to be assessed only once at our "neat choke
> point". Everything else falls into place afterwards, since that's
> where teh downlink actually comes from. You can check it during a leaf
> page split while debugging (that's the neat choke point). That's why
> the high key doesn't seem very interesting from a debuggability
> perspective.
>

OK. So, I mentioned that storing hikeys "as is" might be useful
for debug in some cases. That doesn't mean it's irreplaceable. For sure,
there are other ways to debug and obtain debugging information which
could be even better in certain situation.

>> Nobody asked you to write a suffix truncation patch. That has
> >> complexity above and beyond what the covering index patch needs. I
> >> just expect it to be compatible with an eventual suffix truncation
> >> patch, which you've now shown is quite possible. It is clearly a
> >> complimentary technique.
> >
> >
> > OK, but change of on-disk tuple format also changes what people
> > see in pageinspect. Right now, they see "1" as offset for tuples in
> intenal
> > page and hikeys. After patch, they would see some large values
> > (assuming we set some of hi bits) in offset. I'm not sure it's OK.
> > We probably should change display of index tuples in pageinspect.
>
> This reminds me of a discussion I had with Robert Haas about
> pageinspect + t_infomask bits. Robert thought that we should show the
> composite bits as single constants, where we do that (with things like
> HEAP_XMIN_FROZEN). I disagreed, saying I think that we should just
> show "the bits that are on the page", while also documenting that this
> situation exists in pageinspect directly.
>
> I think something similar here. I think it's okay to just show offset,
> provided it is documented. We have a number of odd things within
> nbtree that I actually saw to it were documented, such as the "minus
> infinity" item on internal pages, which looks odd and out of places. I
> remember Tatsuo Ishii asked about it before this happened. It seems
> helpful to show what's really there, and offer guidance on how to
> interpret it. I actually thought carefully about many things like this
> for pg_hexedit, which tries to be very consistent and logical, uses
> color to suggest meaning, and so on.
>
> Anyway, that's what I think about it, though I wouldn't really care if
> I lost that particular argument and we did something special with
> internal page offset in pageinspect. It seems like a matter of
> opinion, or aesthetics.
>

I just thought that users might be confused when "1" offset in internal
pages became "32769" or something. However, with attached patchset
it would be as least some more obvious value which is easier to interpret
in decimal form.

> I'd like to note that I really appreciate your attention to this patch
> > as well as other patches.
>
> Thanks. I would like to thank Anastasia and you for your patience and
> perseverance, despite what I see as mistakes in how this project was
> manged. I really want for it to be possible for there to be more

patches in the nbtree code, because they're really needed. That was a
> big part of my motivation for writing amcheck, in fact. It's tedious
> to link this patch to a bigger picture about what we need to do with
> nbtree in the next 5 years, but I think that that's what it will take
> to get this patch in. That's my opinion.
>

Yes. But that depends on how difficulty to adopt patch to big picture
correlate with difficulty, which non-adopted patch makes to that big
picture. My point was that second difficulty isn't high. But we can be
satisfied with implementation in the attached patchset (probably some
small enhancements are still required), then the first difficulty isn't
high too.

1.
https://www.postgresql.org/message-id/20161230223530.gvu4azrsfqzr5eus%40alvherre.pgsql
2.
https://www.slideshare.net/AlexanderKorotkov/inmemory-oltp-storage-with-persistence-and-transaction-support

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment Content-Type Size
0001-Covering-core-v11.patch application/octet-stream 108.1 KB
0002-Covering-btree-v11.patch application/octet-stream 58.8 KB
0003-Covering-amcheck-v11.patch application/octet-stream 4.3 KB
0004-Covering-natts-v11.patch application/octet-stream 16.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2018-04-02 23:27:35 Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Previous Message Andres Freund 2018-04-02 23:23:24 Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS