Re: Index Skip Scan (new UniqueKeys)

From: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Floris Van Nee <florisvannee(at)optiver(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: Re: Index Skip Scan (new UniqueKeys)
Date: 2020-07-27 10:24:31
Message-ID: 20200727102431.mjspwec4yhzeyny2@localhost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Tue, Jul 21, 2020 at 04:35:55PM -0700, Peter Geoghegan wrote:
>
> > Well, it's obviously wrong, thanks for noticing. What is necessary is to
> > compare two index tuples, the start and the next one, to test if they're
> > the same (in which case if I'm not mistaken probably we can compare item
> > pointers). I've got this question when I was about to post a new version
> > with changes to address feedback from Andy, now I'll combine them and
> > send a cumulative patch.
>
> This sounds like approximately the same problem as the one that
> _bt_killitems() has to deal with as of Postgres 13. This is handled in
> a way that is admittedly pretty tricky, even though the code does not
> need to be 100% certain that it's "the same" tuple. Deduplication kind
> of makes that a fuzzy concept. In principle there could be one big
> index tuple instead of 5 tuples, even though the logical contents of
> the page have not been changed between the time we recording heap TIDs
> in local and the time _bt_killitems() tried to match on those heap
> TIDs to kill_prior_tuple-kill some index tuples -- a concurrent
> deduplication pass could do that. Your code needs to be prepared for
> stuff like that.
>
> Ultimately posting list tuples are just a matter of understanding the
> on-disk representation -- a "Small Matter of Programming". Even
> without deduplication there are potential hazards from the physical
> deletion of LP_DEAD-marked tuples in _bt_vacuum_one_page() (which is
> not code that runs in VACUUM, despite the name). Make sure that you
> hold a buffer pin on the leaf page throughout, because you need to do
> that to make sure that VACUUM cannot concurrently recycle heap TIDs.
> If VACUUM *is* able to concurrently recycle heap TIDs then it'll be
> subtly broken. _bt_killitems() is safe because it either holds on to a
> pin or gives up when the LSN changes at all. (ISTM that your only
> choice is to hold on to a leaf page pin, since you cannot just decide
> to give up in the way that _bt_killitems() sometimes can.)

I see, thanks for clarification. You're right, in this part of
implementation there is no way to give up if LSN changes like
_bt_killitems does. As far as I can see the leaf page is already pinned
all the time between reading relevant tuples and comparing them, I only
need to handle posting list tuples.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Pryzby 2020-07-27 10:39:02 Re: shared tempfile was not removed on statement_timeout
Previous Message Hamid Akhtar 2020-07-27 08:53:28 Re: Should we remove a fallback promotion? take 2