Re: Index Skip Scan

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Floris Van Nee <florisvannee(at)optiver(dot)com>
Cc: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject: Re: Index Skip Scan
Date: 2020-01-23 00:30:10
Message-ID: CAH2-Wz=4UfUU6hTFmb8HqotvCszdFzNCFMkx4=38JUJyLs-mqg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jan 22, 2020 at 10:55 AM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> On Tue, Jan 21, 2020 at 11:50 PM Floris Van Nee
> <florisvannee(at)optiver(dot)com> wrote:
> > As far as I know, a page split could happen at any random element in the page. One of the situations we could be left with is:
> > Leaf page 1 = [2,4]
> > Leaf page 2 = [5,6,8]
> > However, our scan is still pointing to leaf page 1. For non-skip scans this is not a problem, as we already read all matching elements in our local buffer and we'll return those. But the skip scan currently:
> > a) checks the lo-key of the page to see if the next prefix can be found on the leaf page 1
> > b) finds out that this is actually true
> > c) does a search on the page and returns value=4 (while it should have returned value=6)
> >
> > Peter, is my understanding about the btree internals correct so far?
>
> This is a good summary. This is the kind of scenario I had in mind
> when I expressed a general concern about "stopping between pages".
> Processing a whole page at a time is a crucial part of how
> _bt_readpage() currently deals with concurrent page splits.

I want to be clear about what it means that the page doesn't have a
"low key". Let us once again start with a very simple one leaf-page
btree containing four elements: leaf page 1 = [2,4,6,8] -- just like
in Floris' original page split scenario.

Let us also say that page 1 has a left sibling page -- page 0. Page 0
happens to have a high key with the integer value 0. So you could
*kind of* claim that the "low key" of page 1 is the integer value 0
(page 1 values must be > 0) -- *not* the integer value 2 (the
so-called "low key" here is neither > 2, nor >= 2). More formally, an
invariant exists that says that all values on page 1 must be
*strictly* greater than the integer value 0. However, this formal
invariant thing is hard or impossible to rely on when we actually
reach page 1 and want to know about its lower bound -- since there is
no "low key" pivot tuple on page 1 (we can only speak of a "low key"
as an abstract concept, or something that works transitively from the
parent -- there is only a physical high key pivot tuple on page 1
itself).

Suppose further that Page 0 is now empty, apart from its "all values
on page are <= 0" high key (page 0 must have had a few negative
integer values in its tuples at some point, but not anymore). VACUUM
will delete the page, *changing the effective low key* of Page 0 in
the process. The lower bound from the shared parent page will move
lower/left as a consequence of the deletion of page 0. nbtree page
deletion makes the "keyspace move right, not left". So the "conceptual
low key" of page 1 just went down from 0 to -5 (say), without there
being any practical way of a skip scan reading page 1 noticing the
change (the left sibling of page 0, page -1, has a high key of <= -5,
say).

Not only is it possible for somebody to insert the value 1 in page 1
-- now they can insert the value -3 or -4!

More concretely, the pivot tuple in the parent that originally pointed
to page 0 is still there -- all that page deletion changed about this
tuple is its downlink, which now points to page 1 instead or page 0.
Confusingly, page deletion removes the pivot tuple of the right
sibling page from the parent -- *not* the pivot tuple of the empty
page that gets deleted (in this case page 0) itself.

Note: this example ignores things like negative infinity values in
truncated pivot tuples, and the heap TID tiebreaker column -- in
reality this would look a bit different because of those factors.

See also: amcheck's bt_right_page_check_scankey() function, which has
a huge comment that reasons about a race involving page deletion. In
general, page deletion is by far the biggest source of complexity when
reasoning about the key space.
--
Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2020-01-23 02:41:37 Re: Improve search for missing parent downlinks in amcheck
Previous Message Andrew Gierth 2020-01-22 23:35:13 Re: A rather hackish POC for alternative implementation of WITH TIES