Re: Index Skip Scan

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
Cc: Floris Van Nee <florisvannee(at)optiver(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject: Re: Index Skip Scan
Date: 2020-01-20 21:19:30
Message-ID: CAH2-Wzk6hvXvZ+1q4JBRunZkxmJwdXxnNsWuKJBp+F+mJuZYZw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jan 20, 2020 at 11:01 AM Jesper Pedersen
<jesper(dot)pedersen(at)redhat(dot)com> wrote:
> > - nbtsearch.c _bt_skip line 1440
> > if (BTScanPosIsValid(so->currPos) &&
> > _bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir))
> >
> > Is it allowed to look at the high key / low key of the page without have a read lock on it?
> >
>
> In case of a split the page will still contain a high key and a low key,
> so this should be ok.

This is definitely not okay.

> > - nbtsearch.c in general
> > Most of the code seems to rely quite heavily on the fact that xs_want_itup forces _bt_drop_lock_and_maybe_pin to never release the buffer pin. Have you considered that compacting of a page may still happen even if you hold the pin? [1] I've been trying to come up with cases in which this may break the patch, but I haven't able to produce such a scenario - so it may be fine.

Try making _bt_findinsertloc() call _bt_vacuum_one_page() whenever the
page is P_HAS_GARBAGE(), regardless of whether or not the page is
about to split. That will still be correct, while having a much better
chance of breaking the patch during stress-testing.

Relying on a buffer pin to prevent the B-Tree structure itself from
changing in any important way seems likely to be broken already. Even
if it isn't, it sounds fragile.

A leaf page doesn't really have anything called a low key. It usually
has a current first "data item"/non-pivot tuple, which is an
inherently unstable thing. Also, it has a very loose relationship with
the high key of the left sibling page, which the the closest thing to
a low key that exists (often they'll have almost the same key values,
but that is not guaranteed at all). While I haven't studied the patch,
the logic within _bt_scankey_within_page() seems fishy to me for that
reason.

> There is a BT_READ lock in place when finding the correct leaf page, or
> searching within the leaf page itself. _bt_vacuum_one_page deletes only
> LP_DEAD tuples, but those are already ignored in _bt_readpage. Peter, do
> you have some feedback for this ?

It sounds like the design of the patch relies on doing something other
than stopping a scan "between" pages, in the sense that is outlined in
the commit message of commit 09cb5c0e. If so, then that's a serious
flaw in its design.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Fetter 2020-01-20 21:39:23 Re: Increase psql's password buffer size
Previous Message Christoph Berg 2020-01-20 21:16:37 Re: libxml2 is dropping xml2-config