Re: Optimize single tuple fetch from nbtree index

From: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>
To: Floris Van Nee <florisvannee(at)Optiver(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Optimize single tuple fetch from nbtree index
Date: 2019-08-23 17:14:26
Message-ID: 4b660433-d74e-2075-28d9-907e09df9b9a@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

05.08.2019 14:34, Floris Van Nee wrote:
> I have one further question about these index offsets. There are several comments in master that indicate that it's impossible that an item moves 'left' on a page, if we continuously hold a pin on the page. For example, _bt_killitems has a comment like this:
>
> * Note that if we hold a pin on the target page continuously from initially
> * reading the items until applying this function, VACUUM cannot have deleted
> * any items from the page, and so there is no need to search left from the
> * recorded offset. (This observation also guarantees that the item is still
> * the right one to delete, which might otherwise be questionable since heap
> * TIDs can get recycled.) This holds true even if the page has been modified
> * by inserts and page splits, so there is no need to consult the LSN.
>
> Still, exactly this case happens in practice. In my tests I was able to get behavior like:
> 1) pin + lock a page in _bt_first
> 2) read a tuple, record indexOffset (for example offset=100) and heap tid
> 3) unlock page, but*keep* the pin (end of _bt_first of my patch)
> 4) lock page again in _bt_next (we still hold the pin, so vacuum shouldn't have occurred)
> 5) look inside the current page for the heap Tid that we registered earlier
> 6) we find that we can now find this tuple at indexOffset=98, eg. it moved left. This should not be possible.
> This case sometimes randomly happens when running 'make check', which is why I added code in my patch to also look left on the page from the previous indexOffset.
>
> However, this is in contradiction with the comments (and code) of _bt_killitems.
> Is the comment incorrect/outdated or is there a bug in vacuum or any other part of Postgres that might move index items left even though there are others holding a pin?

Hello,
welcome to hackers with your first patch)

As far as I understood from the thread above, the design of this
optimization is under discussion, so I didn't review the proposed patch
itself.
Though, I got interested in the comment inconsistency you have found.
I added debug message into this code branch of the patch and was able to
see it in regression.diffs after 'make check':
Speaking of your patch, it seems that the buffer was unpinned and pinned
again between two reads,
and the condition of holding it continuously has not been met.

I didn't dig into the code, but this line looks suspicious (see my
findings about BTScanPosIsPinned below):

        /* bump pin on current buffer for assignment to mark buffer */
        if (BTScanPosIsPinned(so->currPos))
            IncrBufferRefCount(so->currPos.buf);

While reading the code to answer your question, I noticed that
BTScanPosIsPinned macro name is misleading.
It calls BufferIsValid(), not BufferIsPinned() as one could expect.
And BufferIsValid in bufmgr.h comment explicitly states that it
shouldn't be confused with BufferIsPinned.
The same goes for BTScanPosUnpinIfPinned().

I propose that we update BTScanPosIsPinned macro. Or, at least write a
comment, why its current behavior is fine.
There are a few existing callers, that are definitely expecting that
this macro checks a pin, which it doesn't do.
I don't quite understand if that already causes any subtle bug, or the
current algorithm is fine.

Peter, Tom, what do you think?

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2019-08-23 17:21:44 Re: pg_checksums --help synopsis is quite long
Previous Message Merlin Moncure 2019-08-23 17:03:50 Re: Hstore OID bigger than an integer