Re: Index Skip Scan

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Floris Van Nee <florisvannee(at)optiver(dot)com>
Cc: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject: Re: Index Skip Scan
Date: 2020-01-22 18:55:21
Message-ID: CAH2-WzmMEOtW_yXE6SJhbUWwLgBGr699NrkpGpMp8toW_ZgGrg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 21, 2020 at 11:50 PM Floris Van Nee
<florisvannee(at)optiver(dot)com> wrote:
> Anyone please correct me if I'm wrong, but I think one case where the current patch relies on some data from the page it has locked before it in checking this hi/lo key. I think it's possible for the following sequence to happen. Suppose we have a very simple one leaf-page btree containing four elements: leaf page 1 = [2,4,6,8]
> We do a backwards index skip scan on this and have just returned our first tuple (8). The buffer is left pinned but unlocked. Now, someone else comes in and inserts a tuple (value 5) into this page, but suppose the page happens to be full. So a page split occurs. As far as I know, a page split could happen at any random element in the page. One of the situations we could be left with is:
> Leaf page 1 = [2,4]
> Leaf page 2 = [5,6,8]
> However, our scan is still pointing to leaf page 1. For non-skip scans this is not a problem, as we already read all matching elements in our local buffer and we'll return those. But the skip scan currently:
> a) checks the lo-key of the page to see if the next prefix can be found on the leaf page 1
> b) finds out that this is actually true
> c) does a search on the page and returns value=4 (while it should have returned value=6)
>
> Peter, is my understanding about the btree internals correct so far?

This is a good summary. This is the kind of scenario I had in mind
when I expressed a general concern about "stopping between pages".
Processing a whole page at a time is a crucial part of how
_bt_readpage() currently deals with concurrent page splits.

Holding a buffer pin on a leaf page is only effective as an interlock
against VACUUM completely removing a tuple, which could matter with
non-MVCC scans.

> Now that I look at the patch again, I fear there currently may also be such a dependency in the "Advance forward but read backward"-case. It saves the offset number of a tuple in a variable, then does a _bt_search (releasing the lock and pin on the page). At this point, anything can happen to the tuples on this page - the page may be compacted by vacuum such that the offset number you have in your variable does not match the actual offset number of the tuple on the page anymore. Then, at the check for (nextOffset == startOffset) later, there's a possibility the offsets are different even though they relate to the same tuple.

If skip scan is restricted to heapkeyspace indexes (i.e. those created
on Postgres 12+), then it might be reasonable to save an index tuple,
and relocate it within the same page using a fresh binary search that
uses a scankey derived from the same index tuple -- without unsetting
scantid/the heap TID scankey attribute. I suppose that you'll need to
"find your place again" after releasing the buffer lock on a leaf page
for a time. Also, I think that this will only be safe with MVCC scans,
because otherwise the page could be concurrently deleted by VACUUM.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2020-01-22 19:09:35 Re: Index Skip Scan
Previous Message Robert Haas 2020-01-22 18:53:08 Re: making the backend's json parser work in frontend code