Re: Returning nbtree posting list TIDs in DESC order during backwards scans

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>
Cc: Andy Fan <zhihuifan1213(at)163(dot)com>, Mircea Cadariu <cadariu(dot)mircea(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Returning nbtree posting list TIDs in DESC order during backwards scans
Date: 2025-12-03 17:31:24
Message-ID: CAH2-WzmRsiCOZaL9HzEvtr_yNHjX0GsujLgz=9599sSjNVkR8Q@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Dec 3, 2025 at 7:32 AM Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> wrote:
> The old code only sets so->numKilled to 0 and reuse memory of so->killedItems[], now the new code always bms_free(so->killedItems) and re-alloc memory when next time adding a member to bms.
>
> I am think that, if there were bms_clear(), then we could have just cleared the bitmap and reused the memory next time. But unfortunately, there is no such a bms_clear() now. What do you think to add bms_clear() and use it here? I don’t want to do that, I can try that once you push this patch.

I don't think that it's worth the complexity. We can rely on palloc()
to recycle the memory that was freed by the most recent bms_clear().

I say this in part because the complexity of reusing the same
Bitmapset would be rather high. The idea that the only valid
representation of an empty set is a NULL pointer is baked into the
Bitmapset API.

Note also that we'll use much less memory for killedItems by
representing it as a Bitmapset. We'll use at most one bit per
so->currPos.items[] item, whereas before we used 4 bytes per item.

> I wonder if the comment “must be at lease one” should apply to the assignment of tupleOffset? The “for” loop starts from nitems-2, will it still must be at lease one item?

By definition, a posting list tuple has at least 2 TIDs -- that's a
posting list invariant. If there was only 1 TID, then it wouldn't be a
posting list in the first place. (Unlike within GIN, where single TID
posting lists are possible.)

> /*
> - * Don't bother advancing the outermost loop's int iterator to
> - * avoid processing killed items that relate to the same
> - * offnum/posting list tuple. This micro-optimization hardly
> - * seems worth it. (Further iterations of the outermost loop
> - * will fail to match on this same posting list's first heap
> - * TID instead, so we'll advance to the next offnum/index
> - * tuple pretty quickly.)
> + * Don't advance itemIndex for outermost loop, no matter how
> + * nextIndex was advanced. It's possible that items whose
> + * TIDs weren't matched in posting list can still be killed
> + * (there might be a later tuple whose TID is a match).
> */
> if (j == nposting)
> killtuple = true;
> ```
>
> I really don’t get what "Don't bother advancing the outermost loop's int iterator” means? Here we only set killtuple to true, then if (killtuple && !ItemIdIsDead(iid)), it breaks the inner while loop, in that case, outer while loop "while ((itemIndex = bms_next_member(so->killedItems, itemIndex)) >= 0)” will advance itemIndex.

The old comment should probably have been written more like the new
one (that I propose to replace it with). It's saying "don't try to be
clever by remembering that we already determined that all the TIDs
that we tried to compare to this posting list aren't matches for *any*
TID". But I don't think that that's accurate; we *haven't* determined
that those TIDs aren't matches for *any and all* TIDs on the page
(only for those now in the posting list specifically). We might still
be able to match those TIDs to later tuples, immediately after the
posting list.

Note that this is all a bit academic and theoretical; in practice it
rarely comes up. During so->dropPin scans (which is the common case),
we'll give up/get scared at the start of _bt_killitems if the page has
changed at all since it was initially examined within _bt_readpage
anyway -- no matter how the page was modified. It doesn't matter that
the page probably *won't* have been modified by VACUUM when
_bt_kilitems gets scared of modifying the page like this (VACUUM is
the only thing that truly makes it unsafe for _bt_killitems to run,
but _bt_killitems is conservative/easily scared).

So while it is true that "We might still be able to match those TIDs
to later tuples, immediately after the posting list" (as I said in the
paragraph before the previous paragraph), we can only do so during
!so->dropPin scans. In other words, only during index-only scans, or
scans of an unlogged relation, or when we don't use an MVCC snapshot.
All of which are rare. Which makes all this fairly
academic/theoretical (mostly it's historical, 10+ years ago *all*
scans were !so->dropPin scans).

--
Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2025-12-03 17:49:26 Re: IPC/MultixactCreation on the Standby server
Previous Message Álvaro Herrera 2025-12-03 17:25:27 Re: pgindent versus struct members and typedefs