Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>
Subject: Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
Date: 2019-03-08 10:12:18
Message-ID: 7061281c-74a9-558b-2b4c-8aeb26b9b894@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 08/03/2019 12:22, Peter Geoghegan wrote:
> I would like to work through these other items with you
> (_bt_binsrch_insert() and so on), but I think that it would be helpful
> if you made an effort to understand the minusinfkey stuff first. I
> spent a lot of time improving the explanation of that within
> _bt_compare(). It's important.

Ok, after thinking about it for a while, I think I understand the minus
infinity stuff now. Let me try to explain it in my own words:

Imagine that you have an index with two key columns, A and B. The index
has two leaf pages, with the following items:

+--------+ +--------+
| Page 1 | | Page 2 |
| | | |
| 1 1 | | 2 1 |
| 1 2 | | 2 2 |
| 1 3 | | 2 3 |
| 1 4 | | 2 4 |
| 1 5 | | 2 5 |
+--------+ +--------+

The key space is neatly split on the first key column - probably thanks
to the new magic in the page split code.

Now, what do we have as the high key of page 1? Answer: "2 -inf". The
"-inf" is not stored in the key itself, the second key column is just
omitted, and the search code knows to treat it implicitly as a value
that's lower than any real value. Hence, "minus infinity".

However, during page deletion, we need to perform a search to find the
downlink pointing to a leaf page. We do that by using the leaf page's
high key as the search key. But the search needs to treat it slightly
differently in that case. Normally, searching with a single key value,
"2", we would land on page 2, because any real value beginning with "2"
would be on that page, but in the page deletion case, we want to find
page 1. Setting the BTScanInsert.minusinfkey flag tells the search code
to do that.

Question: Wouldn't it be more straightforward to use "1 +inf" as page
1's high key? I.e treat any missing columns as positive infinity. That
way, the search for page deletion wouldn't need to be treated
differently. That's also how this used to work: all tuples on a page
used to be <= high key, not strictly < high key. And it would also make
the rightmost page less of a special case: you could pretend the
rightmost page has a pivot tuple with all columns truncated away, which
means positive infinity.

You have this comment _bt_split which touches the subject:

> /*
> * The "high key" for the new left page will be the first key that's going
> * to go into the new right page, or possibly a truncated version if this
> * is a leaf page split. This might be either the existing data item at
> * position firstright, or the incoming tuple.
> *
> * The high key for the left page is formed using the first item on the
> * right page, which may seem to be contrary to Lehman & Yao's approach of
> * using the left page's last item as its new high key when splitting on
> * the leaf level. It isn't, though: suffix truncation will leave the
> * left page's high key fully equal to the last item on the left page when
> * two tuples with equal key values (excluding heap TID) enclose the split
> * point. It isn't actually necessary for a new leaf high key to be equal
> * to the last item on the left for the L&Y "subtree" invariant to hold.
> * It's sufficient to make sure that the new leaf high key is strictly
> * less than the first item on the right leaf page, and greater than or
> * equal to (not necessarily equal to) the last item on the left leaf
> * page.
> *
> * In other words, when suffix truncation isn't possible, L&Y's exact
> * approach to leaf splits is taken. (Actually, even that is slightly
> * inaccurate. A tuple with all the keys from firstright but the heap TID
> * from lastleft will be used as the new high key, since the last left
> * tuple could be physically larger despite being opclass-equal in respect
> * of all attributes prior to the heap TID attribute.)
> */

But it doesn't explain why it's done like that.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Shaoqi Bai 2019-03-08 10:14:41 Add tablespace tap test to pg_rewind
Previous Message Peter Eisentraut 2019-03-08 10:09:44 Re: insensitive collations