Re: Returning nbtree posting list TIDs in DESC order during backwards scans

From: Andy Fan <zhihuifan1213(at)163(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Mircea Cadariu <cadariu(dot)mircea(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Returning nbtree posting list TIDs in DESC order during backwards scans
Date: 2025-07-27 13:42:53
Message-ID: 87h5yx92ci.fsf@163.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

> On Thu, Jul 17, 2025 at 2:26 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>> The loop has an early check for this (for non-itemDead entries) here:
>>
>> /* Quickly skip over items never ItemDead-set by btgettuple */
>> if (!kitem->itemDead)
>> continue;
>>
>> I really doubt that this matters

> I'll need to think about issues around adding the new kitem->itemDead
> bitfield. I'm not really worried about _bt_killitems; more so about
> the routines called by _bt_readpage, which must make sure that the bit
> is unset every time a TID is saved in so->currPos.items[].

Yes, otherwise the live tuple maybe marked as dead, which is terrible. I
think you have unset the bit on all the needed places, including
_bt_saveitem, _bt_savepostingitem and _bt_setuppostingitems for the
first item in postlist item. I can't find out anything is missed.

(I thought another place for this work might be _bt_returnitem, this
might be a more centralized place to set. Later I think it is totally
wrong because for the queries like SELECT * FROM t WHERE idx_col = 3
LIMIT 3; Some items in so->currPos.items[] were filled in
_bt_readpage but maybe never returned to heap at all, and later
_bt_killitems also run against it on the whole, so unsetting the bit on
_bt_returnitem is too late...)

I think this patch does two things. one is refactoring the data struct
for _bt_killitems, the other one is scaning the postlist in the backward
way for the backward scan. then does the below changes belongs to any of
them? Is it an intentional change?

_bt_killitems:

if (offnum < minoff)
- continue; /* pure paranoia */
+ {
+ /*
+ * Page must have originally been the rightmost page, but has
+ * since split
+ */
+ Assert(!so->dropPin);
+ offnum = minoff;
+ }

At last, I can get the same test result as you. The buffer hits go back to
23 in the test case, thank for working on this!

> I think that this patch isn't too far off being committable.

I think so.

--
Best Regards
Andy Fan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Burd 2025-07-27 14:12:11 Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock
Previous Message Andrey Borodin 2025-07-27 11:53:45 Re: IPC/MultixactCreation on the Standby server