Re: [PROPOSAL] Covering + unique indexes.

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Nicolas Barbier <nicolas(dot)barbier(at)gmail(dot)com>
Cc: Rod Taylor <rod(dot)taylor(at)gmail(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PROPOSAL] Covering + unique indexes.
Date: 2016-03-16 01:51:56
Message-ID: CAM3SWZQmxk1HnV8KgcODoZyLVgfaVEpB84Ci1YyMBPNYux0Wmw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 14, 2016 at 8:43 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> Does this work with amcheck? Maybe it works with bt_index_check(), but
> not bt_index_parent_check()? I think that you need to make sure that
> _bt_compare() knows about this, too. That's because it isn't good
> enough to let a truncated internal IndexTuple compare equal to a
> scankey when non-truncated attributes are equal. I think you need to
> have an imaginary "minus infinity" attribute past the first
> non-truncated attribute (i.e. "minus infinity value" for the first
> *truncated* attribute). That way, the downlinks will always be lower
> bounds when the non-"included"/truncated attributes are involved. This
> seems necessary. No?

Actually, I now think I got this slightly wrong.

What's at issue is this (from nbtree README):

"""
Lehman and Yao assume that the key range for a subtree S is described
by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent
page. This does not work for nonunique keys (for example, if we have
enough equal keys to spread across several leaf pages, there *must* be
some equal bounding keys in the first level up). Therefore we assume
Ki <= v <= Ki+1 instead. A search that finds exact equality to a
bounding key in an upper tree level must descend to the left of that
key to ensure it finds any equal keys in the preceding page.

"""

Today, nbtree needs to check the page to the left of an equal internal
page downlink child anyway. That isn't hard-coded into _bt_compare(),
though. If it was, it would be a "positive infinity" attribute, not
"negative infinity" as I incorrectly said. This is because the equal
IndexTuples might easily not begin exactly at the beginning of the
downlink's child page (often, we should begin in the left page
instead, by following the previous downlink in the parent instead --
just in case).

Any kind of "infinity" attribute probably isn't necessary for your
patch today, since, as referenced in the README extract above, our
slightly non-standard L&Y does this in _bt_binsrch():

/*
* At this point we have high == low, but be careful: they could point
* past the last slot on the page.
*
* On a leaf page, we always return the first key >= scan key (resp. >
* scan key), which could be the last slot + 1.
*/
if (P_ISLEAF(opaque))
return low;

However, I think it's still a good idea to have a special integer in
the IndexTuple explicitly indicating the attribute at which the
"suffix" is truncated, even if the "suffix truncation" happens at a
consistent point based on an index in your patch. That will change in
the future, and we should be prepared.

Even though I was partially mistaken, clearly it still wasn't okay to
even try to compare non-existent attributes in internal pages, since
that segfaulted. So a (mostly imaginary) "positive infinity" attribute
can still exist, initially just to make _bt_compare() not crash. This
attribute number (stored in "itup->t_tid.ip_posid") effectively tells
the binary search code to look at the child to the left of the
compared downlink (not the downlink child itself), even though that's
already going to happen per the code above. So, thinking about it once
more (uh, sorry), _bt_compare() has to "indicate equality"/return 0,
*despite* being *logically* a "positive infinity" comparison from a
higher level, in order to let the code above to handle it instead, so
it isn't handled more than once. Also, not sure if less common
"nextkey == true" case needs some further consideration (lets forget
that detail for time being, though). Phew!

So, as I said, _bt_binsrch() and/or _bt_compare() can be fixed to make
sure that the scan arrives on the correct leaf page (the first leaf
page that an matching IndexTuple could be on). What then, though? What
about leaf pages, that *do* have the extra attributes ("INCLUDING"
attributes) represented in their tuples, and *don't* "return
OffsetNumberPrev(low)" at the end of _bt_binsrch() (they do the
P_LEAF() thing quoted above)? Are they safe? Remember:

* For nextkey=false (cmpval=1), the loop invariant is: all slots before
* 'low' are < scan key, all slots at or after 'high' are >= scan key.

I think this means that you need to be very careful about leaf pages, too.

Speculative insertion (used by UPSERT) may not even have the extra
attributes, during the precheck that starts from within
check_exclusion_or_unique_constraint() -- it needs to start from the
very start at the leaf level, without regard for the particular
details of the non-constrained extra columns. I see that you take the
number of attributes a new way, so that ultimately _bt_compare()'s
"keysz" argument only includes "full" attributes for cases like the
UPSERT one. That seems okay. However, what about the case where a
tuple is inserted? We need to do unique enforcement on the one hand,
but need to start at the right place for insertion on the other hand.
Is it okay that _bt_findinsertloc() is passed "indnkeyatts" rather
than "natts" now? Now, it looks okay that _bt_check_unique() has also
been directly fixed to do the same, but I don't think that
_bt_findinsertloc() should have received this treatment. Otherwise,
the ordering on leaf pages is subtly broken, which I don't think is
okay.

I don't see tests for NULL values, which are a little special. Does
_bt_isequal() really not require additional checks? It looks correct
at a quick glance, but you should definitely have tests for this. Lots
of them.

Testing
-------

Maybe you should add this to your tools used for testing, just
customized a bit to use your new feature:
https://github.com/petergeoghegan/jjanes_upsert . This stress-testing
tool could be very interesting -- it was extremely valuable during the
development of UPSERT (code is a bit rough, though -- I don't really
know Perl). It should really stress the implementation, and show bugs.

I suggest hacking amcheck to only check the leaf level, and make sure
it alone is sane/in full order, correcting the patch as needed. This
is a small step, and so possibly a good next step. Once we know the
keyspace is at least sane at the leaf level (which I think it isn't
right now, due to the _bt_findinsertloc() issue), we can build on it.
After all, the keyspace at the leaf level should be the same with or
without this patch -- right? We can fix that *in isolation*, gaining
confidence with amcheck (hacked to just care about leaf pages), which
is relatively straightforward. Afterwards, we move on to the more
complicated matter of internal pages.

Also, I attach a file with a selection of SQL queries that use
pageinspect. I wrote these to get a quick sense of the keyspace of a
B-Tree index, and a few other interesting things -- seeing how the
keyspace looks by looking at internal page highkeys (the last query)
is probably particularly valuable for testing a patch like this, or a
more general suffix truncation patch. You might have written queries
like this yourself already, and if not you can start with these. I was
actually thinking of putting them on Github or something myself, but
they're a bit rough at the moment. Getting a "quick sense of the
keyspace" with the last query lets you see when it has suddenly become
unbalanced during development -- it's a good thing to keep an eye on.

That's all I have for today. I still haven't actually built the patch
myself -- this feedback is all based on reading the code. So hope I
missed nothing obvious due to that.

--
Peter Geoghegan

Attachment Content-Type Size
btree-pageinspect-queries.sql application/sql 3.8 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2016-03-16 02:00:01 Re: pam auth - add rhost item
Previous Message David Rowley 2016-03-16 01:23:53 Re: Parallel Aggregate