Re: WIP: Covering + unique indexes.

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>
Cc: David Steele <david(at)pgmasters(dot)net>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: Covering + unique indexes.
Date: 2016-04-04 22:48:45
Message-ID: CAM3SWZTfBosvae8Gh6AESRVKT3Fad7XcDkhZSbEqazZPE81rwQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 21, 2016 at 9:53 AM, Anastasia Lubennikova
<a(dot)lubennikova(at)postgrespro(dot)ru> wrote:
> Thanks to David,
> I've missed these letters at first.
> I'll answer here.

Sorry about using the wrong thread.

> I agree that _bt_insertonpg() is not right for truncation.

Cool.

> It's a bit more complicated to add it into index creation algorithm.
> There's a trick with a "high key".
> /*
> * We copy the last item on the page into the new page, and then
> * rearrange the old page so that the 'last item' becomes its high
> key
> * rather than a true data item. There had better be at least two
> * items on the page already, else the page would be empty of useful
> * data.
> */
> /*
> * Move 'last' into the high key position on opage
> */
>
> To be consistent with other steps of algorithm ( all high keys must be
> truncated tuples), I had to update this high key on place:
> delete the old one, and insert truncated high key.

Hmm. But the high key comparing equal to the Scankey gives insertion
the choice of where to put its IndexTuple (it can go on the page with
the high key, or its right-sibling, according only to considerations
about fillfactor, etc). Is this changed? Does it not matter? Why not?
Is it just worth it?

The right-most page on every level has no high-key. But you say those
pages have an "imaginary" *positive* infinity high key, just as
internal pages have (non-imaginary) minus infinity downlinks as their
first item/downlink. So tuples in a (say) leaf page are always bound
by the downlink lower bound in parent, while their own high key is an
upper bound. Either (and, rarely, both) could be (positive or
negative) infinity.

Maybe you now see why I talked about special _bt_compare() logic for
this. I proposed special logic that is similar to the existing minus
infinity thing _bt_compare() does (although _bt_binsrch(), an
important caller of _bt_compare() also does special things for
internal .vs leaf case, so I'm not sure any new special logic must go
in _bt_compare()).

> It is a very important issue. But I don't think it's a bug there.
> I've read amcheck sources thoroughly and found that the problem appears at
> "invariant_key_less_than_equal_nontarget_offset()

> It uses scankey, made with _bt_mkscankey() which uses only key attributes,
> but calls _bt_compare with wrong keysz.
> If we wiil use nkeyatts = state->rel->rd_index->relnatts; instead of natts,
> all the checks would be passed successfully.

I probably shouldn't have brought amcheck into that particular
discussion. I thought amcheck might be a useful way to frame the
discussion, because amcheck always cares about specific invariants,
and notes a few special cases.

> In my view, it's the correct way to fix this problem, because the caller is
> responsible for passing proper arguments to the function.
> Of course I will add a check into bt_compare, but I'd rather make it an
> assertion (see the patch attached).

I see what you mean, but I think we need to decide what to do about
the key space when leaf high keys are truncated. I do think that
truncating the high key was the right idea, though, and it nicely
illustrates that nothing special should happen in upper levels. Suffix
truncation should only happen when leaf pages are split, generally
speaking.

As I said, the high key is very similar to the downlinks, in that both
bound the things that go on each page. Together with downlinks they
represent *discrete* ranges *unambiguously*, so INCLUDING truncation
needs to make it clear which page new items go on. As I said,
_bt_binsrch() already takes special actions for internal pages, making
sure to return the item that is < the scankey, not <= the scankey
which is only allowed for leaf pages. (See README, from "Lehman and
Yao assume that the key range for a subtree S is described by Ki < v
<= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent
page...").

To give a specific example, I worry about the case where two sibling
downlinks in a parent page are distinct, but per specific-to-Postgres
"Ki <= v <= Ki+1" thing (which differs from the classic L&Y
invariant), some tuples with all right downlink's attributes matching
end up in left child page, not right child page. I worry that since
_bt_findsplitloc() doesn't consider this (for example), the split
point doesn't *reliably* and unambiguously divide the key space
between the new halves of a page being split. I think the "Ki <= v <=
Ki+1"/_bt_binsrch() thing might save you in common cases where all
downlink attributes are distinct, so maybe that simpler case is okay.
But to be even more specific, what about the more complicated case
where the downlinks *are* fully _bt_compare()-wise equal? This could
happen even though they're constrained to be unique in leaf pages, due
to bloat. Unique indexes aren't special here; they just make it far
less likely that this would happen in practice, because it takes a lot
of bloat. Less importantly, when that bloat happens, you don't want to
have to do a linear scan through many leaf pages (that should only
happen when there are many fully matching IndexTuples at the leaf
level -- not just matching on constrained attributes).

The more I think about it, the more I doubt that it's okay to not
ensure downlinks are always distinct with their siblings, by sometimes
including non-constrained (truncatable) attributes within internal
pages, as needed to *distinguish* downlinks (also, we must
occasionally have *all* attributes including truncatable attributes in
internal pages -- we must truncate nothing to keep the key space sane
in the parent). Unfortunately, these requirements are very close to
the actual full requirements for a full, complete suffix truncation
patch, including storing how many attributes are stored in each and
every internal IndexTuple (no general thing for the index), page split
code to determine where to truncate to make adjacent downlinks
distinct, etc.

You may think: But that fully-matching-downlink case is okay, because
it only makes us do more linear scanning due to the lack of
non-truncatable attributes, which is still correct, if a little more
slow when there is bloat -- at the leaf level, we'll start at the
correct place (the first place the item could be on), per the "Ki <= v
<= Ki+1"/_bt_binsrch() thing. I don't think it's correct, though. We
need to be able to reliably detect a concurrent page-split. Otherwise,
we'll move right within _bt_search() before even considering if
anything of interest for our index scan *might* be on the initial page
found from downlink (before even calling _bt_binsrch()). Even this bug
wouldn't happen in the common case where nextkey = true, but what
about when nextkey = false (e.g. for backwards scans)? We'd skip stuff
we are not supposed to by spuriously moving right, I think. I have a
bad feeling that even then we'd "accidentally fail to fail", because
of how backwards scans work at a higher level, but it's just too hard
to prove that that is correct. It's just too complicated to rely on so
much from a great distance.

This might not be the simplest example of where we could run into
trouble, but it's one example that I could see. The assumption that
downlinks and highkeys discretely separate ranges in the key space is
probably made many times. There could be more problematic spots, and
it's really hard to know where they might be. :-(

In general, it's common for any modification to the B-Tree code to
only break in a very subtle way, like this. I would be more
comfortable if I knew the patch received extensive stress-testing,
probably involving amcheck, lots of bloat, lots of VACUUMing, etc. But
generally, I believe we should not allow the key space to fail to be
separated fully by downlinks and high keys, even if our original "Ki
<= v <= Ki+1" changes to the L&Y algorithm to make duplicates work
happens to mask the problems in simple testing. It's too different to
what we have today.

> I'll add a flag to distinguish regular and truncated tuples, but it will not
> be used in this patch. Please, comment, if I've missed something.
> As you've already mentioned, neither high keys, nor tuples on internal pages
> are using "itup->t_tid.ip_posid", so I'll take one bit of it.
>
> It will definitely require changes in the future works on suffix truncation
> or something like that, but IMHO for now it's enough.

I think that we need to discuss whether or not it's okay that we can
have that fully-matching-downlink case before we can be sure either
way.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2016-04-04 23:59:54 oversight in parallel aggregate
Previous Message Tom Lane 2016-04-04 22:31:20 Re: More stable query plans via more predictable column statistics