Re: Index Skip Scan

From: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To: Floris Van Nee <florisvannee(at)Optiver(dot)com>
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject: Re: Index Skip Scan
Date: 2020-01-22 16:04:41
Message-ID: 20200122160441.ograupmzyytde2mi@localhost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> On Wed, Jan 22, 2020 at 07:50:30AM +0000, Floris Van Nee wrote:
> Anyone please correct me if I'm wrong, but I think one case where the current patch relies on some data from the page it has locked before it in checking this hi/lo key. I think it's possible for the following sequence to happen. Suppose we have a very simple one leaf-page btree containing four elements: leaf page 1 = [2,4,6,8]
> We do a backwards index skip scan on this and have just returned our first tuple (8). The buffer is left pinned but unlocked. Now, someone else comes in and inserts a tuple (value 5) into this page, but suppose the page happens to be full. So a page split occurs. As far as I know, a page split could happen at any random element in the page. One of the situations we could be left with is:
> Leaf page 1 = [2,4]
> Leaf page 2 = [5,6,8]
> However, our scan is still pointing to leaf page 1.

In case if we just returned a tuple, the next action would be either
check the next page for another key or search down to the tree. Maybe
I'm missing something in your scenario, but the latter will land us on a
required page (we do not point to any leaf here), and before the former
there is a check for high/low key. Is there anything else missing?

> Now that I look at the patch again, I fear there currently may also be such a dependency in the "Advance forward but read backward"-case. It saves the offset number of a tuple in a variable, then does a _bt_search (releasing the lock and pin on the page). At this point, anything can happen to the tuples on this page - the page may be compacted by vacuum such that the offset number you have in your variable does not match the actual offset number of the tuple on the page anymore. Then, at the check for (nextOffset == startOffset) later, there's a possibility the offsets are different even though they relate to the same tuple.

Interesting point. The original idea here was to check that we're not
returned to the same position after jumping, so maybe instead of offsets
we can check a tuple we found.

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-01-22 16:06:59 Re: [HACKERS] kqueue
Previous Message Sergei Kornilov 2020-01-22 15:58:46 Re: pgsql: walreceiver uses a temporary replication slot by default