Re: WIP: Covering + unique indexes.

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Cc: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: WIP: Covering + unique indexes.
Date: 2018-04-05 02:40:17
Message-ID: CAH2-Wznq7whVwQ=xU1eejvuxkY-Gb18pjy40kmPdkaAba1qfMA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Apr 4, 2018 at 3:09 PM, Alexander Korotkov
<a(dot)korotkov(at)postgrespro(dot)ru> wrote:
> Thank you for review! Revised patchset is attached.

Cool.

* btree_xlog_split() still has this code:

/*
* On leaf level, the high key of the left page is equal to the first key
* on the right page.
*/
if (isleaf)
{
ItemId hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque));

left_hikey = (IndexTuple) PageGetItem(rpage, hiItemId);
left_hikeysz = ItemIdGetLength(hiItemId);
}

However, we never fail to store the high key now, even at the leaf
level, because of this change to the corresponding point in
_bt_split():

> - /* Log left page */
> - if (!isleaf)
> - {
> - /*
> - * We must also log the left page's high key, because the right
> - * page's leftmost key is suppressed on non-leaf levels. Show it
> - * as belonging to the left page buffer, so that it is not stored
> - * if XLogInsert decides it needs a full-page image of the left
> - * page.
> - */
> - itemid = PageGetItemId(origpage, P_HIKEY);
> - item = (IndexTuple) PageGetItem(origpage, itemid);
> - XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
> - }
> + /*
> + * We must also log the left page's high key. There are two reasons
> + * for that: right page's leftmost key is suppressed on non-leaf levels,
> + * in covering indexes, included columns are truncated from high keys.
> + * For simplicity, we don't distinguish these cases, but log the high
> + * key every time. Show it as belonging to the left page buffer, so
> + * that it is not stored if XLogInsert decides it needs a full-page
> + * image of the left page.
> + */
> + itemid = PageGetItemId(origpage, P_HIKEY);
> + item = (IndexTuple) PageGetItem(origpage, itemid);
> + XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));

So should we remove the first block of code? Note also that this
existing comment has been made obsolete:

/* don't release the buffer yet; we touch right page's first item below */

/* Now reconstruct left (original) sibling page */
if (XLogReadBufferForRedo(record, 0, &lbuf) == BLK_NEEDS_REDO)

Maybe we *should* release the right sibling buffer at the point of the
comment now?

* _bt_mkscankey() should assert that the IndexTuple has the correct
number of attributes.

I don't expect you to change routines like _bt_mkscankey() so they
actually respect the number of attributes from BTreeTupGetNAtts(),
rather than just relying on IndexRelationGetNumberOfKeyAttributes().
However, an assertion seems well worthwhile. It's a big reason for
having BTreeTupGetNAtts().

This also lets you get rid of at least one assertion from
_bt_doinsert(), I think.

* _bt_isequal() should assert that the IndexTuple was not truncated.

* The order could be switched here:

> @@ -443,6 +443,17 @@ _bt_compare(Relation rel,
> if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
> return 1;
>
> + /*
> + * Check tuple has correct number of attributes.
> + */
> + if (unlikely(!_bt_check_natts(rel, page, offnum)))
> + {
> + ereport(ERROR,
> + (errcode(ERRCODE_INTERNAL_ERROR),
> + errmsg("tuple has wrong number of attributes in index \"%s\"",
> + RelationGetRelationName(rel))));
> + }

In principle, we should also check _bt_check_natts() for "minus
infinity" items, just like you did within verify_nbtree.c. Also, there
is no need for parenthesis here.

* Maybe _bt_truncate_tuple() should assert that the caller has not
tried to truncate a tuple that has already been truncated.

I'm not sure if our assertion should be quite that strong, but I think
that that might be good because in general we only need to truncate on
the leaf level -- truncating at any other level on the tree (e.g.
doing traditional suffix truncation) is always subtly wrong. What we
definitely should do, at a minimum, is make sure that attempting to
truncate a tuple to 2 attributes when it already has 0 attributes
fails with an assertion failure.

Can you try adding the strong assertion (truncate only once) to
_bt_truncate_tuple()? Maybe that's not possible, but it seems worth a
try.

* I suggest we invent a new flag for 0x2000 within itup.h, to replace
"/* bit 0x2000 is reserved for index-AM specific usage */".

We can call it INDEX_AM_RESERVED_BIT. Then, we can change
INDEX_ALT_TID_MASK to use this rather than a raw 0x2000. We can do the
same for INDEX_MOVED_BY_SPLIT_MASK within hash.h, too. I find this
neater.

* We should "use" one of the 4 new status bit that are available from
an offset (for INDEX_ALT_TID_MASK index tuples) for future use by leaf
index tuples. Perhaps call it BT_ALT_TID_NONPIVOT.

I guess you could say that I want us to reserve one of our 4 reserve bits.

* I think that you could add to this:

> +++ b/src/backend/access/nbtree/README
> @@ -590,6 +590,10 @@ original search scankey is consulted as each index entry is sequentially
> scanned to decide whether to return the entry and whether the scan can
> stop (see _bt_checkkeys()).
>
> +We use term "pivot" index tuples to distinguish tuples which don't point
> +to heap tuples, but rather used for tree navigation. Pivot tuples includes
> +all tuples on non-leaf pages and high keys on leaf pages.

I like what you came up with, and where you put it, but I would add
another few sentences: "Note that pivot index tuples are only used to
represent which part of the key space belongs on each page, and can
have attribute values copied from non-pivot tuples that were deleted
and killed by VACUUM some time ago. In principle, we could truncate
away attributes that are not needed for a page high key during a leaf
page split, provided that the remaining attributes distinguish the
last index tuple on the post-split left page as belonging on the left
page, and the first index tuple on the post-split right page as
belonging on the right page. This optimization is sometimes called
suffix truncation, and may appear in a future release. Since the high
key is subsequently reused as the downlink in the parent page for the
new right page, suffix truncation can increase index fan-out
considerably by keeping pivot tuples short. INCLUDE indexes similarly
truncate away non-key attributes at the time of a leaf page split,
increasing fan-out."

> Good point. Tests with "heapallindexed" were added. I also find that it's
> useful to
> check both index built by sorting, and index built by insertions, because
> there are
> different ways of forming tuples.

Right. It's a good cross-check for things like that. We'll have to
teach bt_tuple_present_callback() to normalize the representation in
some way for the BT_ALT_TID_NONPIVOT case in the future. But it
already talks about normalizing for reasons like this, so that's okay.

* I think you should add a note about BT_ALT_TID_NONPIVOT to
bt_tuple_present_callback(), though. If it cannot be sure that every
non-pivot tuple will have the same representation, amcheck will have
to normalize to the most flexible representation before hashing.

> Ok. I've tried to remove both assertions and "+1" hack. That works
> for me. However, I've to touch a lot of places, not sure if that's a
> problem.

Looks good to me. If it makes an assertion fail, that's probably a
good thing, because it would have been broken before anyway.

* You missed this comment, which is now not accurate:

> + * It's possible that index tuple has zero attributes (leftmost item of
> + * iternal page). And we have assertion that offset number is greater or equal
> + * to 1. This is why we store (number_of_attributes + 1) in offset number.
> + */

I can see that it is actually 0 for a minus infinity item, which is good.

> I wrote some comment there. Please, check it.

The nbtsort.c comments could maybe do with some tweaks from a native
speaker, but look correct.

> Regarding !P_ISLEAF(), I think we should check every item on both
> leaf and non-leaf pages. I that is how code now works unless I'm missing
> something.

It does, and should. Thanks.

> Thanks for pointing. Since there are now three cases including handling of
> "minus infinity" items, comment is now split to three.

That looks good. Thanks.

Right now, it looks like every B-Tree index could use
INDEX_ALT_TID_MASK, regardless of whether or not it's an INCLUDE
index. I think that that's fine, but let's say so in the paragraph
that introduces INDEX_ALT_TID_MASK. This patch establishes that any
nbtree pivot tuple could have INDEX_ALT_TID_MASK set, and that's
something that can be expected. It's also something that might not be
set when pg_upgrade was used, but that's fine too.

> I don't seerelations between this patchset and SSI. We just
> change representation of some index tuples in pages. However,
> we didn't change the the order of page modification, the order
> of page lookup and so on. Yes, we change size of some tuples,
> but B-tree already worked with tuples of variable sizes. So, the fact
> that tuples now have different size shouldn't affect SSI. Right now,
> I'm not sure if CheckForSerializableConflictIn() just above the
> _bt_doinsert() is good idea. But even if so, I think that should be
> a subject of separate patch.

My point was that that nothing changes, because we already use what
_bt_doinsert() calls the "first valid" page. Maybe just add: "(This
reasoning also applies to INCLUDE indexes, whose extra attributes are
not considered part of the key space.)".

That's it for today.
--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2018-04-05 03:14:03 Re: [HACKERS] Runtime Partition Pruning
Previous Message Chapman Flack 2018-04-05 02:34:04 Re: lazy detoasting