Re: Fully documenting the design of nbtree row comparison scan keys

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Fully documenting the design of nbtree row comparison scan keys
Date: 2025-10-31 16:15:28
Message-ID: CAH2-Wzk5AQpYtDFHj4uQwon=_JJGpB4=pXGAfpmUX8PQch7nZA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Oct 31, 2025 at 4:57 AM Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> wrote:
> Personally, I feel that the new comment is even harder to understand than reading the code directly.

I don't disagree. But that's not the goal that I have in mind.

My goal is to make it clear when it is okay to treat row comparisons
just like scalar inequalities (on the same column as a row compare's
leading column), and when it is not okay. This matters to code in
places like _bt_advance_array_keys (which knows about inequalities but
not about row compare inequalities specifically), and in places like
_bt_set_startikey (which has to know about the difference between row
compares and simple scalar inequalities).

> ```
> + * Things get more complicated for our row compare with rows where "a = 7".
> + * Note that a row comparison isn't necessarily satisfied by _every_ row that
> + * appears after the first satisfied/matching row. A forwards scan that uses
> + * our example qual might first return a row "(a, b, c) = (7, 'zebra', 54)".
> + * But it must not return a row "(a, b, c) = (7, NULL, 1)" that'll appear to
> + * the right of the first match (assumes that "b" was declared NULLS LAST).
> + * The scan only returns additional matches upon reaching rows where "a > 7".
> + * If you rereview our example row comparison's logical expansion, you'll
> + * understand why this is so.
> + *
> ```
>
> This paragraph is actually describing how index data are stored, (in which order index data are loaded,) but I think that doesn’t matter to this function.

The order in which data is stored is obviously relevant. The main use
case for row compares is to implement keyset pagination, where each
scan returns only a subset of index tuples in index order. The next
scan starts to return tuples precisely after the end of matches from
the previous scan. This has to work in lexicographic/index order to
reliably avoid returning the same row twice (across each of the scans
performed by the application). That's the whole point.

In general, with both row compares and simple scalar inequalities, we
can end an index scan upon reaching a NULL tuple value that doesn't
satisfy a row compare qual according to NULL semantics (the result is
UNKNOWN, not false). This is safe only when we know that NULL is the
last value stored in the index column -- NULL doesn't satisfy the
inequality, *and* there can be no later value after NULL that will
satisfy the inequality for the entire qual, so we can just end the
scan right away. The nbtree code is directly exploiting what it knows
about the order that the data is stored (and NULL semantics).

> The corresponding code for this process is quite short. The big portion of the function are handling the situations when column data is NULL and when const data is NULL, where there are inline comments to describe the behaviors.

If it's so obvious, then why did 2016 bugfix commit a298a1e0 get it
wrong? It's easy to get confused about why and how these rules apply,
and the implications. See the last paragraph of my commit bd3f59fd for
full details.

It's not obvious that NULLs in lower-order index columns will make a >
row compare qual not return any matches at the very start of a
forwards scan (or at the very end of a backwards scan), without
affecting the results for the rest of the scan in any way. And so the
recently added row compare code in _bt_set_startikey must be extra
careful: it cannot just assume that every tuple on the page must
satisfy a row compare qual just because both the first and last index
tuple on the page satisfy the row compare qual. The funny rules about
NULLs mean that there could be a group of index tuples in the middle
of the page that don't satisfy the row compare (in spite of the first
and last tuples doing so). This is very much unlike simple scalar
inequalities, which will always return *all* tuples in the index after
the first matching tuple up to and including the last matching tuple,
without any "gaps" where index tuples are examined but not returned by
the scan (assuming that the inequality key is marked required to
continue the scan, which is typical).

--
Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jacob Champion 2025-10-31 16:24:17 Re: libpq: Bump protocol version to version 3.2 at least until the first/second beta
Previous Message Álvaro Herrera 2025-10-31 16:15:14 Re: List TAP test files in makefiles