|From:||Dmitry Dolgov <9erthalion6(at)gmail(dot)com>|
|To:||Floris Van Nee <florisvannee(at)Optiver(dot)com>|
|Cc:||Peter Geoghegan <pg(at)bowt(dot)ie>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, James Coleman <jtc331(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Bhushan Uparkar <bhushan(dot)uparkar(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>|
|Subject:||Re: Index Skip Scan|
|Views:||Raw Message | Whole Thread | Download mbox | Resend email|
> On Wed, Jan 22, 2020 at 10:36:03PM +0100, Dmitry Dolgov wrote:
> > Let me try to clarify. After we return the first tuple, so->currPos.buf is pointing to page=1 in my example (it's the only page after all). We've returned item=8. Then the split happens and the items get rearranged as in my example. We're still pointing with so->currPos.buf to page=1, but the page now contains [2,4]. The split happened to the right, so there's a page=2 with [5,6,8], however the ongoing index scan is unaware of that.
> > Now _bt_skip gets called to fetch the next tuple. It starts by checking _bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir), the result of which will be 'true': we're comparing the skip key to the low key of the page. So it thinks the next key can be found on the current page. It locks the page and does a _binsrch, finding item=4 to be returned.
> > The problem here is that _bt_scankey_within_page mistakenly returns true, thereby limiting the search to just the page that it's pointing to already.
> > It may be fine to just fix this function to return the proper value (I guess it'd also need to look at the high key in this example). It could also be fixed by not looking at the lo/hi key of the page, but to use the local tuple buffer instead. We already did a _read_page once, so if we have any matching tuples on that specific page, we have them locally in the buffer already. That way we never need to lock the same page twice.
> Oh, that's what you mean. Yes, I was somehow tricked by the name of this
> function and didn't notice that it checks only one boundary, so in case
> of backward scan it returns wrong result. I think in the situation
> you've describe it would actually not find any item on the current page
> and restart from the root, but nevertheless we need to check for both
> keys in _bt_scankey_within_page.
Thanks again everyone for commentaries and clarification. Here is the
version, where hopefully I've addressed all the mentioned issues.
As mentioned in the _bt_skip commentaries, before we were moving left to
check the next page to avoid significant issues in case if ndistinct was
underestimated and we need to skip too often. To make it work safe in
presense of splits we need to remember an original page and move right
again until we find a page with the right link pointing to it. It's not
clear whether it's worth to increase complexity for such sort of "edge
case" with ndistinct estimation while moving left, so at least for now
we ignore this in the implementation and just start from the root
Offset based code in moving forward/reading backward was replaced with
remembering a start index tuple and an attempt to find it on the new
page. Also a missing page lock before _bt_scankey_within_page was added
and _bt_scankey_within_page checks for both page boundaries.
|Next Message||Kyotaro Horiguchi||2020-01-27 10:54:01||Re: [PATCH] /src/backend/access/transam/xlog.c, tiny improvements|
|Previous Message||Kyotaro Horiguchi||2020-01-27 10:28:31||Re: [HACKERS] WAL logging problem in 9.4.3?|