From: | Peter Geoghegan <pg(at)bowt(dot)ie> |
---|---|
To: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Returning nbtree posting list TIDs in DESC order during backwards scans |
Date: | 2025-06-16 23:47:01 |
Message-ID: | CAH2-WznLN7P0i2-YEnv3QGmeA5AMjdcjkraO_nz3H2Va1V1WOA@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Jun 16, 2025 at 12:46 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> Making this change in _bt_readpage creates a problem in _bt_killitems:
> it currently expects posting list TIDs in ascending order (the loop
> invariant relies on that), which is why the deduplication patch didn't
> just do things like this in _bt_readpage to begin with. The patch
> deals with that new problem by qsort'ing the so->killedItems[] array
> (at the top of _bt_killitems) such that the items always appear
> exactly once, in the expected ASC order.
Actually, we can just completely get rid of so->killedItems[]. We can
replace it with a new 1 bit itemDead field in the BTScanPosItem struct
(the struct used for so->currPos.items[] entries). That way, the patch
avoids a qsort. The patch doesn't need to change the order of anything
(except of course for the order that _bt_readpage initially saves
posting list tuple TIDs in, when scanning backwards).
The problem with so->killedItems[] is that it stores things in
whatever order successive btgettuple calls find most convenient in the
kill_prior_tuple path. It turns out that it's even more convenient for
btgettuple if its this kill_prior_tuple path simply sets the relevant
(prior/just-returned) item's so->currPos.items[].itemDead field. There
is no need to allocate memory for so->killedItems[], and no need to
worry about running out of space in so->killedItems[] in cases
involving scrollable cursors that happen to scroll back and forth,
triggering kill_prior_tuple for the same TID. We'll simply set the
TID/item's so->currPos.items[].itemDead field a second or a third
time, which can't complicate things, since my new approach makes that
idempotent.
Attached is v2 of the patch, which works that way. This seems like a
better direction to take things in within _bt_killitems.
In general we're quite sensitive to the size of the BTScanPosItem
struct, since (at least for now) we routinely allocate
MaxTIDsPerBTreePage-many of them (i.e. 1,358 of them). 1,358 * 10
bytes is already too much. But there's no need to make the relevant
allocation even larger. I can steal a bit from
BTScanPosItem.tupleOffset for these new
so->currPos.items[].itemDead/BTScanPosItem.itemDead fields. That's
safe, since we only really need 15 bits for each
BTScanPosItem.tupleOffset offset.
--
Peter Geoghegan
Attachment | Content-Type | Size |
---|---|---|
v2-0001-Return-TIDs-in-desc-order-during-backwards-scans.patch | application/octet-stream | 14.4 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Sutou Kouhei | 2025-06-16 23:50:37 | Re: Make COPY format extendable: Extract COPY TO format implementations |
Previous Message | Masahiko Sawada | 2025-06-16 23:46:53 | Re: Fix slot synchronization with two_phase decoding enabled |