| From: | Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp> |
|---|---|
| To: | Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp> |
| Cc: | Peter Geoghegan <pg(at)bowt(dot)ie>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Re: Add a berief general comment on BTScanInsertData's nextkey and backward |
| Date: | 2026-03-24 10:25:37 |
| Message-ID: | 20260324192537.089a69fd6729326e6276441f@sraoss.co.jp |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi Peter,
On Wed, 31 Dec 2025 19:21:24 +0900
Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp> wrote:
> > I wonder if we also do something about these existing _bt_binsrch comments:
> >
> > * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber
> > * of the last key < given scankey, or last key <= given scankey if nextkey
> > * is true. (Since _bt_compare treats the first data key of such a page as
> > * minus infinity, there will be at least one key < scankey, so the result
> > * always points at one of the keys on the page.)
> >
> > Here we describe what happens on an internal page. This is correct,
> > but *seems* to contradict the higher level comments at the end of
> > _bt_first. There is actually no contradiction between the two -- not
> > when you understand that _bt_first describes the whole end-to-end
> > process of calling _bt_search and then calling _bt_binsrch on the leaf
> > page (not of calling _bt_binsrch once, against an internal page).
> >
> > One could think of this _bt_binsrch internal page behavior as
> > compensating for the fact that internal pages have pivot tuples that
> > consist of a separator key (except for the first key, which is just
> > has a -inf key/no key), plus a downlink that points to the *next* page
> > after that separator key one level down (except for the internal page
> > high key, which has no downlink at all). Might be useful to say
> > something like that instead.
> >
> > This is hard to explain without an example. Right now, an internal
> > page might have pivot tuples that look like this:
> >
> > Separator: -inf, Downlink: 1
> > Separator: 'a', Downlink: 2
> > Separator: 'c', Downlink: 3
> > Separator: 'f', Downlink: (none, this is the high key)
> >
> > But _bt_binsrch "pretends" that our internal page actually contains:
> >
> > Downlink: 1
> > Separator: 'a'
> > Downlink: 2
> > Separator: 'c'
> > Downlink: 3
> > Separator: 'f'
> >
> > So if our = scan key is (say) 'c', we should descend using the
> > downlink that points to block 2 (not the one that points to block 3,
> > as might be expected from looking at the real physical representation
> > consisting of pivot tuples). That is what the scan needs to do to get
> > to the page one level down whose high key is also 'c' -- that's where
> > the scan needs to look. (If the next level down is the leaf level,
> > then the leaf page we descend to might also contain a *non*-pivot
> > tuple with the key value 'c', "right before" the high key with 'c',
> > which the scan will need to return in _bt_readpage. Lehman & Yao allow
> > the key before a leaf page's high key to be equal to the high key, but
> > otherwise forbid all duplicate keys.)
> >
> > I find it very hard to know what explanation will be the least
> > confusing to other people, at least in this area. Since you're
> > interested in this area, I thought I'd ask what you think.
>
> I do not see any contradiction between the comment on _bt_binsrch and
> the comments at the end of _bt_first. The former explicitly states that
> it refers to internal (non-leaf) pages, and I understand the latter to
> describe loading data from a leaf page.
>
> That said, we could make this clearer by explicitly mentioning the leaf
> page in the first sentence, for example:
>
> * Now load data from the first leaf page of the scan (usually the
> *page currently in so->currPos.buf).
I've added a follow-up patch to clarify the _bt_binsrch comments a bit more,
distinguishing internal and leaf page behavior.
Does this help reduce your concern?
Regards,
Yugo Nagata
--
Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp>
| Attachment | Content-Type | Size |
|---|---|---|
| v2-0002-Add-a-berief-general-comment-on-BTScanInsertData-.patch | text/x-diff | 1.1 KB |
| v2-0001-Clarify-_bt_binsrch-comments-for-internal-vs-leaf.patch | text/x-diff | 3.0 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | John Naylor | 2026-03-24 10:30:14 | Re: BUG: test_ginpostinglist second itemptr check is a no-op due to copy-paste error |
| Previous Message | Hayato Kuroda (Fujitsu) | 2026-03-24 10:06:08 | RE: [Proposal] Adding Log File Capability to pg_createsubscriber |