Re: WIP: Covering + unique indexes.

From: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: WIP: Covering + unique indexes.
Date: 2018-04-05 14:59:17
Message-ID: CAPpHfdt5oj0+h7GaNJtRf+kePN5V9uRr0fhFnOgghYQqrxMHfg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi!

Thank you for review. Revised patchset is attached.

On Thu, Apr 5, 2018 at 5:40 AM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:

> On Wed, Apr 4, 2018 at 3:09 PM, Alexander Korotkov
> <a(dot)korotkov(at)postgrespro(dot)ru> wrote:
> > Thank you for review! Revised patchset is attached.
>
> Cool.
>
> * btree_xlog_split() still has this code:
>
> /*
> * On leaf level, the high key of the left page is equal to the first
> key
> * on the right page.
> */
> if (isleaf)
> {
> ItemId hiItemId = PageGetItemId(rpage,
> P_FIRSTDATAKEY(ropaque));
>
> left_hikey = (IndexTuple) PageGetItem(rpage, hiItemId);
> left_hikeysz = ItemIdGetLength(hiItemId);
> }
>
> However, we never fail to store the high key now, even at the leaf
> level, because of this change to the corresponding point in
> _bt_split():
>
> So should we remove the first block of code?

Right, I think there is absolutely no need in this code. It's removed in
the attached patchset.

> Note also that this
> existing comment has been made obsolete:
>
> /* don't release the buffer yet; we touch right page's first item below */
>
> /* Now reconstruct left (original) sibling page */
> if (XLogReadBufferForRedo(record, 0, &lbuf) == BLK_NEEDS_REDO)
>
> Maybe we *should* release the right sibling buffer at the point of the
> comment now?
>

Agreed. We don't need to hold right buffer for getting hikey from it.
The only remaining concern is concurrency at standby. But right page
is unrefenced at this point, and nobody should try read it before we
finish split.

> * _bt_mkscankey() should assert that the IndexTuple has the correct
> number of attributes.
>
> I don't expect you to change routines like _bt_mkscankey() so they
> actually respect the number of attributes from BTreeTupGetNAtts(),
> rather than just relying on IndexRelationGetNumberOfKeyAttributes().
> However, an assertion seems well worthwhile. It's a big reason for
> having BTreeTupGetNAtts().
>

OK, I've added asserting that number of tuple attributes shoud be
either natts or nkeyatts, because we call _bt_mkscankey() for
pivot index tuples too.

This also lets you get rid of at least one assertion from
> _bt_doinsert(), I think.
>

If you're talking about these assertions

Assert(IndexRelationGetNumberOfAttributes(rel) != 0);
indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
Assert(indnkeyatts != 0);

then I would rather leave both them. If we know that index tuple
length is either natts or nkeyatts, that doesn't make you sure, that
both natts and nkeyatts are non-zero.

* _bt_isequal() should assert that the IndexTuple was not truncated.
>

Agreed. Assertion is added. I've to change signature of _bt_isequal()
to do that. However, that shouldn't cause any problem: _bt_isequal()
is even static.

* The order could be switched here:

>
> > @@ -443,6 +443,17 @@ _bt_compare(Relation rel,
> > if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
> > return 1;
> >
> > + /*
> > + * Check tuple has correct number of attributes.
> > + */
> > + if (unlikely(!_bt_check_natts(rel, page, offnum)))
> > + {
> > + ereport(ERROR,
> > + (errcode(ERRCODE_INTERNAL_ERROR),
> > + errmsg("tuple has wrong number of attributes in index
> \"%s\"",
> > + RelationGetRelationName(rel))));
> > + }
>
> In principle, we should also check _bt_check_natts() for "minus
> infinity" items, just like you did within verify_nbtree.c. Also, there
> is no need for parenthesis here.
>

Right. Over is switched. We now check for number of attributes
before checking for "minus infinity" item. Extra parenthesis are removed.

> * Maybe _bt_truncate_tuple() should assert that the caller has not
> tried to truncate a tuple that has already been truncated.
>
> I'm not sure if our assertion should be quite that strong, but I think
> that that might be good because in general we only need to truncate on
> the leaf level -- truncating at any other level on the tree (e.g.
> doing traditional suffix truncation) is always subtly wrong. What we
> definitely should do, at a minimum, is make sure that attempting to
> truncate a tuple to 2 attributes when it already has 0 attributes
> fails with an assertion failure.
>
> Can you try adding the strong assertion (truncate only once) to
> _bt_truncate_tuple()? Maybe that's not possible, but it seems worth a
> try.
>

I've done so. Tests are passing for me.

* I suggest we invent a new flag for 0x2000 within itup.h, to replace
> "/* bit 0x2000 is reserved for index-AM specific usage */".
>
> We can call it INDEX_AM_RESERVED_BIT. Then, we can change
> INDEX_ALT_TID_MASK to use this rather than a raw 0x2000. We can do the
> same for INDEX_MOVED_BY_SPLIT_MASK within hash.h, too. I find this
> neater.
>

Good point, done.

* We should "use" one of the 4 new status bit that are available from
> an offset (for INDEX_ALT_TID_MASK index tuples) for future use by leaf
> index tuples. Perhaps call it BT_ALT_TID_NONPIVOT.
>

Hmm, we have four bit reserved. But I'm not sure whether we would use
*all* of them for non-pivot tuples. Probably we would use some of them for
pivot tuples. I don't know that in advance. Thus, I propose to don't
rename this. But I've added comment that non-pivot tuples might also
use those bits.

I guess you could say that I want us to reserve one of our 4 reserve bits.
>

Sorry, I didn't get which particular further use of reserved bits do you
mean?
Did you mean key normalization?

> * I think that you could add to this:
>
> > +++ b/src/backend/access/nbtree/README
> > @@ -590,6 +590,10 @@ original search scankey is consulted as each index
> entry is sequentially
> > scanned to decide whether to return the entry and whether the scan can
> > stop (see _bt_checkkeys()).
> >
> > +We use term "pivot" index tuples to distinguish tuples which don't point
> > +to heap tuples, but rather used for tree navigation. Pivot tuples
> includes
> > +all tuples on non-leaf pages and high keys on leaf pages.
>
> I like what you came up with, and where you put it, but I would add
> another few sentences: "Note that pivot index tuples are only used to
> represent which part of the key space belongs on each page, and can
> have attribute values copied from non-pivot tuples that were deleted
> and killed by VACUUM some time ago. In principle, we could truncate
> away attributes that are not needed for a page high key during a leaf
> page split, provided that the remaining attributes distinguish the
> last index tuple on the post-split left page as belonging on the left
> page, and the first index tuple on the post-split right page as
> belonging on the right page. This optimization is sometimes called
> suffix truncation, and may appear in a future release. Since the high
> key is subsequently reused as the downlink in the parent page for the
> new right page, suffix truncation can increase index fan-out
> considerably by keeping pivot tuples short. INCLUDE indexes similarly
> truncate away non-key attributes at the time of a leaf page split,
> increasing fan-out."

Thank you for writing that explanation. Looks good.

> Good point. Tests with "heapallindexed" were added. I also find that
> it's
> > useful to
> > check both index built by sorting, and index built by insertions, because
> > there are
> > different ways of forming tuples.
>
> Right. It's a good cross-check for things like that. We'll have to
> teach bt_tuple_present_callback() to normalize the representation in
> some way for the BT_ALT_TID_NONPIVOT case in the future. But it
> already talks about normalizing for reasons like this, so that's okay.
>

Ok.

* I think you should add a note about BT_ALT_TID_NONPIVOT to
> bt_tuple_present_callback(), though. If it cannot be sure that every
> non-pivot tuple will have the same representation, amcheck will have
> to normalize to the most flexible representation before hashing.
>

Ok. I've added relevant comment.

> Ok. I've tried to remove both assertions and "+1" hack. That works
> > for me. However, I've to touch a lot of places, not sure if that's a
> > problem.
>
> Looks good to me. If it makes an assertion fail, that's probably a
> good thing, because it would have been broken before anyway.
>

Ok.

> * You missed this comment, which is now not accurate:
>
> > + * It's possible that index tuple has zero attributes (leftmost item of
> > + * iternal page). And we have assertion that offset number is greater
> or equal
> > + * to 1. This is why we store (number_of_attributes + 1) in offset
> number.
> > + */
>

Right. This comment is no longer needed, removed.

I can see that it is actually 0 for a minus infinity item, which is good.
>

Ok.

> > I wrote some comment there. Please, check it.
>
> The nbtsort.c comments could maybe do with some tweaks from a native
> speaker, but look correct.
>
> > Regarding !P_ISLEAF(), I think we should check every item on both
> > leaf and non-leaf pages. I that is how code now works unless I'm missing
> > something.
>
> It does, and should. Thanks.
>
> > Thanks for pointing. Since there are now three cases including handling
> of
> > "minus infinity" items, comment is now split to three.
>
> That looks good. Thanks.
>

Ok.

Right now, it looks like every B-Tree index could use
> INDEX_ALT_TID_MASK, regardless of whether or not it's an INCLUDE
> index. I think that that's fine, but let's say so in the paragraph
> that introduces INDEX_ALT_TID_MASK. This patch establishes that any
> nbtree pivot tuple could have INDEX_ALT_TID_MASK set, and that's
> something that can be expected. It's also something that might not be
> set when pg_upgrade was used, but that's fine too.
>

I've added comment about that.

> I don't seerelations between this patchset and SSI. We just
> > change representation of some index tuples in pages. However,
> > we didn't change the the order of page modification, the order
> > of page lookup and so on. Yes, we change size of some tuples,
> > but B-tree already worked with tuples of variable sizes. So, the fact
> > that tuples now have different size shouldn't affect SSI. Right now,
> > I'm not sure if CheckForSerializableConflictIn() just above the
> > _bt_doinsert() is good idea. But even if so, I think that should be
> > a subject of separate patch.
>
> My point was that that nothing changes, because we already use what
> _bt_doinsert() calls the "first valid" page. Maybe just add: "(This
> reasoning also applies to INCLUDE indexes, whose extra attributes are
> not considered part of the key space.)".
>

Ok. I've added this comment.

This patchset also incorporates docs enhacements by Erik Rijkers and
sentence which states that indexes with included colums are also called
"covering indexes".

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment Content-Type Size
0001-Covering-core-v13.patch application/octet-stream 109.0 KB
0002-Covering-btree-v13.patch application/octet-stream 58.8 KB
0003-Covering-amcheck-v13.patch application/octet-stream 5.3 KB
0004-Covering-natts-v13.patch application/octet-stream 30.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ildus Kurbangaliev 2018-04-05 15:08:07 Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors
Previous Message Magnus Hagander 2018-04-05 14:58:13 Re: Online enabling of checksums