Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>
Subject: Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
Date: 2019-03-08 18:48:25
Message-ID: CAH2-Wzm_Kxm26E_DwK7AR+ZB_-B50OMpGoO=n08tD+qH=MD-zw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Mar 8, 2019 at 2:12 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> Now, what do we have as the high key of page 1? Answer: "2 -inf". The
> "-inf" is not stored in the key itself, the second key column is just
> omitted, and the search code knows to treat it implicitly as a value
> that's lower than any real value. Hence, "minus infinity".

Right.

> However, during page deletion, we need to perform a search to find the
> downlink pointing to a leaf page. We do that by using the leaf page's
> high key as the search key. But the search needs to treat it slightly
> differently in that case. Normally, searching with a single key value,
> "2", we would land on page 2, because any real value beginning with "2"
> would be on that page, but in the page deletion case, we want to find
> page 1. Setting the BTScanInsert.minusinfkey flag tells the search code
> to do that.

Right.

> Question: Wouldn't it be more straightforward to use "1 +inf" as page
> 1's high key? I.e treat any missing columns as positive infinity.

That might also work, but it wouldn't be more straightforward on
balance. This is because:

* We have always taken the new high key from the firstright item, and
we also continue to do that on internal pages -- same as before. It
would certainly complicate the nbtsplitloc.c code to have to deal with
this new special case now (leaf and internal pages would have to have
far different handling, not just slightly different handling).

* We have always had "-inf" values as the first item on an internal
page, which is explicitly truncated to zero attributes as of Postgres
v11. It seems ugly to me to make truncated attributes mean negative
infinity in that context, but positive infinity in every other
context.

* Another reason that I prefer "-inf" to "+inf" is that you can
imagine an implementation that makes pivot tuples into normalized
binary keys, that are truncated using generic/opclass-agnostic logic,
and compared using strcmp(). If the scankey binary string is longer
than the pivot tuple, then it's greater according to strcmp() -- that
just works. And, you can truncate the original binary strings built
using opclass infrastructure without having to understand where
attributes begin and end (though this relies on encoding things like
NULL-ness a certain way). If we define truncation to be "+inf" now,
then none of this works.

All of that said, maybe it would be clearer if page deletion was not
the special case that has to opt in to minusinfkey semantics. Perhaps
it would make more sense for everyone else to opt out of minusinfkey
semantics, and to get the !minusinfkey optimization as a result of
that. I only did it the other way around because that meant that only
nbtpage.c had to acknowledge the special case.

Even calling it minusinfkey is misleading in one way, because we're
not so much searching for "-inf" values as we are searching for the
first page that could have tuples for the untruncated attributes. But
isn't that how this has always worked, given that we've had to deal
with duplicate pivot tuples on the same level before now? As I said,
we're not doing an extra thing when minusinfykey is true (during page
deletion) -- it's the other way around. Saying that we're searching
for minus infinity values for the truncated attributes is kind of a
lie, although the search does behave that way.

>That way, the search for page deletion wouldn't need to be treated
> differently. That's also how this used to work: all tuples on a page
> used to be <= high key, not strictly < high key.

That isn't accurate -- it still works that way on the leaf level. The
alternative that you've described is possible, I think, but the key
space works just the same with either of our approaches. You've merely
thought of an alternative way of generating new high keys that satisfy
the same invariants as my own scheme. Provided the new separator for
high key is >= last item on the left and < first item on the right,
everything works.

As you point out, the original Lehman and Yao rule for leaf pages
(which Postgres kinda followed before) is that the high key is <=
items on the leaf level. But this patch makes nbtree follow that rule
fully and properly.

Maybe you noticed that amcheck tests < on internal pages, and only
checks <= on leaf pages. Perhaps it led you to believe that I did
things differently. Actually, this is classic Lehman and Yao. The keys
in internal pages are all "separators" as far as Lehman and Yao are
concerned, so the high key is less of a special case on internal
pages. We check < on internal pages because all separators are
supposed to be unique on a level. But, as I said, we do check <= on
the leaf level.

Take a look at "Fig. 7 A B-Link Tree" in the Lehman and Yao paper if
this is unclear. That shows that internal pages have unique keys -- we
can therefore expect the high key to be < items in internal pages. It
also shows that leaf pages copy the high key from the last item on the
left page -- we can expect the high key to be <= items there. Just
like with the patch, in effect. The comment from _bt_split() that you
quoted explains why what we do is like what Lehman and Yao do when
suffix truncation cannot truncate anything -- the new high key on the
left page comes from the last item on the left page.

> And it would also make
> the rightmost page less of a special case: you could pretend the
> rightmost page has a pivot tuple with all columns truncated away, which
> means positive infinity.

But we do already pretend that. How is that not the case already?

> But it doesn't explain why it's done like that.

It's done this way because that's equivalent to what Lehman and Yao
do, while also avoiding adding the special cases that I mentioned (in
nbtsplitloc.c, and so on).

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2019-03-08 19:05:50 Re: Performance issue in foreign-key-aware join estimation
Previous Message Pavel Stehule 2019-03-08 18:48:21 Re: PostgreSQL vs SQL/XML Standards