Quick Links

Re: Returning nbtree posting list TIDs in DESC order during backwards scans

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Returning nbtree posting list TIDs in DESC order during backwards scans
Date:	2025-06-16 23:47:01
Message-ID:	CAH2-WznLN7P0i2-YEnv3QGmeA5AMjdcjkraO_nz3H2Va1V1WOA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Mon, Jun 16, 2025 at 12:46 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> Making this change in _bt_readpage creates a problem in _bt_killitems:
> it currently expects posting list TIDs in ascending order (the loop
> invariant relies on that), which is why the deduplication patch didn't
> just do things like this in _bt_readpage to begin with. The patch
> deals with that new problem by qsort'ing the so->killedItems[] array
> (at the top of _bt_killitems) such that the items always appear
> exactly once, in the expected ASC order.

Actually, we can just completely get rid of so->killedItems[]. We can
replace it with a new 1 bit itemDead field in the BTScanPosItem struct
(the struct used for so->currPos.items[] entries). That way, the patch
avoids a qsort. The patch doesn't need to change the order of anything
(except of course for the order that _bt_readpage initially saves
posting list tuple TIDs in, when scanning backwards).

The problem with so->killedItems[] is that it stores things in
whatever order successive btgettuple calls find most convenient in the
kill_prior_tuple path. It turns out that it's even more convenient for
btgettuple if its this kill_prior_tuple path simply sets the relevant
(prior/just-returned) item's so->currPos.items[].itemDead field. There
is no need to allocate memory for so->killedItems[], and no need to
worry about running out of space in so->killedItems[] in cases
involving scrollable cursors that happen to scroll back and forth,
triggering kill_prior_tuple for the same TID. We'll simply set the
TID/item's so->currPos.items[].itemDead field a second or a third
time, which can't complicate things, since my new approach makes that
idempotent.

Attached is v2 of the patch, which works that way. This seems like a
better direction to take things in within _bt_killitems.

In general we're quite sensitive to the size of the BTScanPosItem
struct, since (at least for now) we routinely allocate
MaxTIDsPerBTreePage-many of them (i.e. 1,358 of them). 1,358 * 10
bytes is already too much. But there's no need to make the relevant
allocation even larger. I can steal a bit from
BTScanPosItem.tupleOffset for these new
so->currPos.items[].itemDead/BTScanPosItem.itemDead fields. That's
safe, since we only really need 15 bits for each
BTScanPosItem.tupleOffset offset.

--
Peter Geoghegan

Attachment	Content-Type	Size
v2-0001-Return-TIDs-in-desc-order-during-backwards-scans.patch	application/octet-stream	14.4 KB

In response to

Returning nbtree posting list TIDs in DESC order during backwards scans at 2025-06-16 16:46:29 from Peter Geoghegan

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Sutou Kouhei	2025-06-16 23:50:37	Re: Make COPY format extendable: Extract COPY TO format implementations
Previous Message	Masahiko Sawada	2025-06-16 23:46:53	Re: Fix slot synchronization with two_phase decoding enabled