Re: Add a berief general comment on BTScanInsertData's nextkey and backward

From: Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp>
To: Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp>
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Add a berief general comment on BTScanInsertData's nextkey and backward
Date: 2026-03-24 10:25:37
Message-ID: 20260324192537.089a69fd6729326e6276441f@sraoss.co.jp
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Peter,

On Wed, 31 Dec 2025 19:21:24 +0900
Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp> wrote:

> > I wonder if we also do something about these existing _bt_binsrch comments:
> >
> > * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber
> > * of the last key < given scankey, or last key <= given scankey if nextkey
> > * is true. (Since _bt_compare treats the first data key of such a page as
> > * minus infinity, there will be at least one key < scankey, so the result
> > * always points at one of the keys on the page.)
> >
> > Here we describe what happens on an internal page. This is correct,
> > but *seems* to contradict the higher level comments at the end of
> > _bt_first. There is actually no contradiction between the two -- not
> > when you understand that _bt_first describes the whole end-to-end
> > process of calling _bt_search and then calling _bt_binsrch on the leaf
> > page (not of calling _bt_binsrch once, against an internal page).
> >
> > One could think of this _bt_binsrch internal page behavior as
> > compensating for the fact that internal pages have pivot tuples that
> > consist of a separator key (except for the first key, which is just
> > has a -inf key/no key), plus a downlink that points to the *next* page
> > after that separator key one level down (except for the internal page
> > high key, which has no downlink at all). Might be useful to say
> > something like that instead.
> >
> > This is hard to explain without an example. Right now, an internal
> > page might have pivot tuples that look like this:
> >
> > Separator: -inf, Downlink: 1
> > Separator: 'a', Downlink: 2
> > Separator: 'c', Downlink: 3
> > Separator: 'f', Downlink: (none, this is the high key)
> >
> > But _bt_binsrch "pretends" that our internal page actually contains:
> >
> > Downlink: 1
> > Separator: 'a'
> > Downlink: 2
> > Separator: 'c'
> > Downlink: 3
> > Separator: 'f'
> >
> > So if our = scan key is (say) 'c', we should descend using the
> > downlink that points to block 2 (not the one that points to block 3,
> > as might be expected from looking at the real physical representation
> > consisting of pivot tuples). That is what the scan needs to do to get
> > to the page one level down whose high key is also 'c' -- that's where
> > the scan needs to look. (If the next level down is the leaf level,
> > then the leaf page we descend to might also contain a *non*-pivot
> > tuple with the key value 'c', "right before" the high key with 'c',
> > which the scan will need to return in _bt_readpage. Lehman & Yao allow
> > the key before a leaf page's high key to be equal to the high key, but
> > otherwise forbid all duplicate keys.)
> >
> > I find it very hard to know what explanation will be the least
> > confusing to other people, at least in this area. Since you're
> > interested in this area, I thought I'd ask what you think.
>
> I do not see any contradiction between the comment on _bt_binsrch and
> the comments at the end of _bt_first. The former explicitly states that
> it refers to internal (non-leaf) pages, and I understand the latter to
> describe loading data from a leaf page.
>
> That said, we could make this clearer by explicitly mentioning the leaf
> page in the first sentence, for example:
>
> * Now load data from the first leaf page of the scan (usually the
> *page currently in so->currPos.buf).

I've added a follow-up patch to clarify the _bt_binsrch comments a bit more,
distinguishing internal and leaf page behavior.

Does this help reduce your concern?

Regards,
Yugo Nagata

--
Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp>

Attachment Content-Type Size
v2-0002-Add-a-berief-general-comment-on-BTScanInsertData-.patch text/x-diff 1.1 KB
v2-0001-Clarify-_bt_binsrch-comments-for-internal-vs-leaf.patch text/x-diff 3.0 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message John Naylor 2026-03-24 10:30:14 Re: BUG: test_ginpostinglist second itemptr check is a no-op due to copy-paste error
Previous Message Hayato Kuroda (Fujitsu) 2026-03-24 10:06:08 RE: [Proposal] Adding Log File Capability to pg_createsubscriber