Re: WIP: Covering + unique indexes.

From: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: David Steele <david(at)pgmasters(dot)net>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: Covering + unique indexes.
Date: 2016-04-05 14:56:54
Message-ID: 5703D236.1090907@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

05.04.2016 01:48, Peter Geoghegan :
> On Mon, Mar 21, 2016 at 9:53 AM, Anastasia Lubennikova
> <a(dot)lubennikova(at)postgrespro(dot)ru> wrote:
>> It's a bit more complicated to add it into index creation algorithm.
>> There's a trick with a "high key".
>> /*
>> * We copy the last item on the page into the new page, and then
>> * rearrange the old page so that the 'last item' becomes its high
>> key
>> * rather than a true data item. There had better be at least two
>> * items on the page already, else the page would be empty of useful
>> * data.
>> */
>> /*
>> * Move 'last' into the high key position on opage
>> */
>>
>> To be consistent with other steps of algorithm ( all high keys must be
>> truncated tuples), I had to update this high key on place:
>> delete the old one, and insert truncated high key.
> Hmm. But the high key comparing equal to the Scankey gives insertion
> the choice of where to put its IndexTuple (it can go on the page with
> the high key, or its right-sibling, according only to considerations
> about fillfactor, etc). Is this changed? Does it not matter? Why not?
> Is it just worth it?

I would say, this is changed, but it doesn't matter.
Performing any search in btree (including choosing suitable page for
insertion), we use only key attributes.
We assume that included columns are stored in index unordered.
Simple example.
create table tbl(id int, data int);
create index idx on tbl (id) including (data);

Select query does not consider included columns in scan key.
It selects all tuples satisfying the condition on key column. And only
after that it applies filter to remove wrong rows from the result.
If key attribute doesn't satisfy query condition, there are no more
tuples to return and we can interrupt scan.

You can find more explanations in the attached sql script,
that contains queries to recieve detailed information about index
structure using pageinspect.

> The right-most page on every level has no high-key. But you say those
> pages have an "imaginary" *positive* infinity high key, just as
> internal pages have (non-imaginary) minus infinity downlinks as their
> first item/downlink. So tuples in a (say) leaf page are always bound
> by the downlink lower bound in parent, while their own high key is an
> upper bound. Either (and, rarely, both) could be (positive or
> negative) infinity.
>
> Maybe you now see why I talked about special _bt_compare() logic for
> this. I proposed special logic that is similar to the existing minus
> infinity thing _bt_compare() does (although _bt_binsrch(), an
> important caller of _bt_compare() also does special things for
> internal .vs leaf case, so I'm not sure any new special logic must go
> in _bt_compare()).
>
>> In my view, it's the correct way to fix this problem, because the caller is
>> responsible for passing proper arguments to the function.
>> Of course I will add a check into bt_compare, but I'd rather make it an
>> assertion (see the patch attached).
> I see what you mean, but I think we need to decide what to do about
> the key space when leaf high keys are truncated. I do think that
> truncating the high key was the right idea, though, and it nicely
> illustrates that nothing special should happen in upper levels. Suffix
> truncation should only happen when leaf pages are split, generally
> speaking.
> As I said, the high key is very similar to the downlinks, in that both
> bound the things that go on each page. Together with downlinks they
> represent *discrete* ranges *unambiguously*, so INCLUDING truncation
> needs to make it clear which page new items go on. As I said,
> _bt_binsrch() already takes special actions for internal pages, making
> sure to return the item that is < the scankey, not <= the scankey
> which is only allowed for leaf pages. (See README, from "Lehman and
> Yao assume that the key range for a subtree S is described by Ki < v
> <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent
> page...").
>
> To give a specific example, I worry about the case where two sibling
> downlinks in a parent page are distinct, but per specific-to-Postgres
> "Ki <= v <= Ki+1" thing (which differs from the classic L&Y
> invariant), some tuples with all right downlink's attributes matching
> end up in left child page, not right child page. I worry that since
> _bt_findsplitloc() doesn't consider this (for example), the split
> point doesn't *reliably* and unambiguously divide the key space
> between the new halves of a page being split. I think the "Ki <= v <=
> Ki+1"/_bt_binsrch() thing might save you in common cases where all
> downlink attributes are distinct, so maybe that simpler case is okay.
> But to be even more specific, what about the more complicated case
> where the downlinks *are* fully _bt_compare()-wise equal? This could
> happen even though they're constrained to be unique in leaf pages, due
> to bloat. Unique indexes aren't special here; they just make it far
> less likely that this would happen in practice, because it takes a lot
> of bloat. Less importantly, when that bloat happens, you don't want to
> have to do a linear scan through many leaf pages (that should only
> happen when there are many fully matching IndexTuples at the leaf
> level -- not just matching on constrained attributes).

"just matching on constrained attributes" is the core idea of the whole
patch. Included columns just provide us possibility to use index-only
scan. Nothing more. We assume use case where index-only-scan is faster
than index-scan + heap fetch. For example, in queries like "select data
from tbl where id = 1;" we have no scan condition on data. Maybe you
afraid of long linear scan when we have enormous index bloat even on
unique index. It will happen anyway, whether we have index-only scan on
covering index or index-scan on unique index + heap fetch. The only
difference is that the covering index is faster.

At the very beginning of the proposal discussion, I suggested to
implement third kind of columns, which are not constrained, but used in
scankey.
They must have op class to do it, and they are not truncated. But it was
decided to abandon this feature.

> The more I think about it, the more I doubt that it's okay to not
> ensure downlinks are always distinct with their siblings, by sometimes
> including non-constrained (truncatable) attributes within internal
> pages, as needed to *distinguish* downlinks (also, we must
> occasionally have *all* attributes including truncatable attributes in
> internal pages -- we must truncate nothing to keep the key space sane
> in the parent). Unfortunately, these requirements are very close to
> the actual full requirements for a full, complete suffix truncation
> patch, including storing how many attributes are stored in each and
> every internal IndexTuple (no general thing for the index), page split
> code to determine where to truncate to make adjacent downlinks
> distinct, etc.
>
> You may think: But that fully-matching-downlink case is okay, because
> it only makes us do more linear scanning due to the lack of
> non-truncatable attributes, which is still correct, if a little more
> slow when there is bloat -- at the leaf level, we'll start at the
> correct place (the first place the item could be on), per the "Ki <= v
> <= Ki+1"/_bt_binsrch() thing. I don't think it's correct, though. We
> need to be able to reliably detect a concurrent page-split. Otherwise,
> we'll move right within _bt_search() before even considering if
> anything of interest for our index scan *might* be on the initial page
> found from downlink (before even calling _bt_binsrch()). Even this bug
> wouldn't happen in the common case where nextkey = true, but what
> about when nextkey = false (e.g. for backwards scans)? We'd skip stuff
> we are not supposed to by spuriously moving right, I think. I have a
> bad feeling that even then we'd "accidentally fail to fail", because
> of how backwards scans work at a higher level, but it's just too hard
> to prove that that is correct. It's just too complicated to rely on so
> much from a great distance.
>
> This might not be the simplest example of where we could run into
> trouble, but it's one example that I could see. The assumption that
> downlinks and highkeys discretely separate ranges in the key space is
> probably made many times. There could be more problematic spots, and
> it's really hard to know where they might be. :-(
>
> In general, it's common for any modification to the B-Tree code to
> only break in a very subtle way, like this. I would be more
> comfortable if I knew the patch received extensive stress-testing,
> probably involving amcheck, lots of bloat, lots of VACUUMing, etc. But
> generally, I believe we should not allow the key space to fail to be
> separated fully by downlinks and high keys, even if our original "Ki
> <= v <= Ki+1" changes to the L&Y algorithm to make duplicates work
> happens to mask the problems in simple testing. It's too different to
> what we have today.

Frankly, I still do not understand what you're worried about.
If high key is greater than the scan key, we definitely cannot find any
more tuples, because key attributes are ordered.
If high key is equal to the scan key, we will continue searching and
read next page.
The code is not changed here, it is the same as processing of duplicates
spreading over several pages. If you do not trust postgresql btree
changes to the L&Y to make duplicates work, I don't know what to say,
but it's definitely not related to my patch.

Of course I do not mind if someone will do more testing.
I did some tests and didn't find anything special. Besides, don't we
have special alpha and beta release stages to find tricky bugs?

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment Content-Type Size
test_including.sql application/sql 1.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2016-04-05 15:22:52 Re: Support for N synchronous standby servers - take 2
Previous Message Robert Haas 2016-04-05 14:47:45 Re: Support for N synchronous standby servers - take 2