| From: | Henson Choi <assam258(at)gmail(dot)com> |
|---|---|
| To: | Tatsuo Ishii <ishii(at)postgresql(dot)org>, jian(dot)universality(at)gmail(dot)com |
| Cc: | pgsql-hackers(at)postgresql(dot)org, zsolt(dot)parragi(at)percona(dot)com, sjjang112233(at)gmail(dot)com, vik(at)postgresfriends(dot)org, er(at)xs4all(dot)nl, jacob(dot)champion(at)enterprisedb(dot)com, david(dot)g(dot)johnston(at)gmail(dot)com, peter(at)eisentraut(dot)org, li(dot)evan(dot)chao(at)gmail(dot)com |
| Subject: | Re: Row pattern recognition |
| Date: | 2026-07-04 16:39:43 |
| Message-ID: | CAAAe_zBMyW7=ZEgZ2=rgka328FKyBS6sw=xpWztv4AD8wskffw@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi hackers,
This is a design note on the core data structures for CLASSIFIER and
row pattern variables, aimed at the next round of window-clause row
pattern recognition (RPR, R020) rather than the current patch. I'd
like to share it and open it up for discussion.
It opens with a reflection, because the reflection is the point. For
about six months I took it for granted that we must keep the full
per-row match history during matching. That premise was wrong. Once I
dropped it, the data-structure problem dissolved into a handful of
small, bounded pieces -- which is what the rest of the note works out:
what to store per case, how much, and how to encode it.
I believe the storage/representation side is settled. What I have
deliberately left open is the part that lives in the matching engine
-- the aggregate accumulator lifecycle (clone on branch, discard on
failure) and the winning-match replay. I'd welcome thoughts there,
particularly from Tatsuo and Jian, given their work on the current
patch.
Best regards,
Henson
----------------------------------------------------------------------
RPR CLASSIFIER / row pattern variable storage (window clause)
I. Reflection -- the "full match history" premise was wrong
Since January this year -- six months into starting RPR work in
earnest -- I kept reasoning under one premise:
To implement CLASSIFIER and row pattern variables, we must keep
the per-row sequence of variable labels (the match history) that
the match consumes, throughout matching.
For that half year I set the data-structure problem on top of it --
"where and how do we hold that whole sequence cheaply?" -- and kept
piling up storage / sharing / compression / lookup structures.
Yet no usable lookup structure ever came out. Worse: matchStates
branch and most of them vanish on match failure, so the cost of
building a lookup structure for each of those countless
(soon-discarded) states was exceeding the lookup efficiency it bought
-- paying up front, for the many that die, on behalf of the few that
survive.
The premise was wrong. RPR does not need to keep the full match
history during matching. The one place a full per-row sequence is
genuinely materialized is the universal-scope MEASURE aggregate (case
8, ARRAY_AGG(CLASSIFIER())). So that entire storage / sharing /
compression / lookup apparatus was solving a problem that did not need
solving.
II. The corrected picture -- what to store, per case
Here is how the storage requirement splits, case by case:
1. DEFINE navigation -- the reference range is bounded by the
navigation offset. The offset is a run-time constant, so the upper
bound to keep during execution is fixed and can be computed
directly, without any lookup. A nav with no row pattern variable
just derives the position from the offset and reads it from the
tuplestore. (How much to retain, per nav function, is detailed in
section III.)
2. bare row pattern variable reference (A.col) -- equivalent to
LAST(A.col, 0). The structure is a pos list for A -- since
everything in it is A, membership is implicit and no varId is
needed. As a bare reference this is the special case with length =
1 (only the last pos); the full list is needed only for an A
aggregate or a larger FIRST/LAST offset (section III).
3. SUBSET (union) variable -- for SUBSET U=(A,B), CLASSIFIER(U) and
U.col are merely derived from the primary classification (which row
is A/B); they are not new information. Because A and B are
interleaved in the list and we must tell them apart, the structure
is a list of (pos, varId) pairs.
4. bare CLASSIFIER() (current row) -- unless it is inside an aggregate
(ARRAY_AGG etc.), only the current row is needed (no list). The
classifier comes straight from the variable (its name) that the
pattern element currently being consumed by the matchState
designates.
5. aggregate in DEFINE -- it drives matching, so it is evaluated live
during matching under RUNNING semantics (no post-processing
possible). The aggregate inputs (which row is which variable, its
column, its classifier) are all already obtained from 1-4, so no
new structure is needed -- only one running accumulator folding
them on top. That is as far as the match-history view goes (no full
history needed). Separately, there are two costs: (a) computing
live for the many matchStates that mostly fail, and (b) cloning the
in-progress accumulator into children on a branch. But this is
aggregate accumulator management cost, not match history.
6-8 (MEASURE aggregates): since the match is already complete, it is
better to avoid paying for a live accumulator per failing candidate
(the cost in 5) and instead process only the winning match by post
replay. What each case must store/replay is below.
6. aggregate in MEASURE -- variable/SUBSET scope -- e.g. sum(A.price),
sum(U.price), ARRAY_AGG(CLASSIFIER(U)). Only membership is needed
(values in the tuplestore), compressed as a bitmap / run length
(alternating m A's and n non-A's; sequential, no lookup). But the
granularity differs: sum(A.price) / sum(U.price) need only is-A /
is-U, whereas ARRAY_AGG(CLASSIFIER(U)) must distinguish which
sub-variable, so it needs per-sub-variable membership (A and B
each). Those per-sub memberships are precisely the per-row member
label, and when several aggregates use the same sub-variable, these
memberships fuse into that single label. (For bare scalar
CLASSIFIER(U), see 3 -- just the last varId in the U list.)
7. aggregate in MEASURE -- CLASSIFIER(variable) inside an aggregate
(special case: optimized to the first-occurrence position) -- in
ARRAY_AGG(CLASSIFIER(A)), CLASSIFIER(A) is a running LAST, so its
value is null before the first A and the constant 'A' afterwards.
So instead of the full membership of 6, it optimizes to a single
first-occurrence position for A (the null->A boundary) -- a
single-variable special case. (For SUBSET CLASSIFIER(U) the member
varies, so this optimization does not apply; handle it in 6.) bare
scalar CLASSIFIER(A) is derived from 2 -- just whether A's list is
empty.
8. aggregate in MEASURE -- universal scope -- every match row is in
scope. For a column (e.g. sum(price)) it is all rows, so not even
membership is needed (values in the tuplestore). For
ARRAY_AGG(CLASSIFIER()), each row's value is the row pattern
variable name (the label), and the storage is a run-length encoding
of varId (a run of the same variable as (varId, run)) -- since all
rows are in order, no pos is needed; just pile up varId runs (order
implies pos). (For bare scalar CLASSIFIER(), see 4 -- just the last
row's varId.)
III. Retained length per nav function -- offset fixes the bound
If 1-2 above say what to hold, this pins down how much of it to hold
during matching, per nav function. The key: the offset is a run-time
constant (a literal, embedded variable, and the like -- not a column
or subquery), so the size of the retained window is fixed at plan
time. So you can pre-allocate a ring buffer of that size per variable,
and there is no lookup during matching.
Physical nav and logical nav split apart:
- physical nav (PREV/NEXT) -- the offset is a physical position in the
partition, so it does not grow the variable's history at all. The
anchor is just "that variable's running last row", and offset n
merely indexes the tuplestore from there. Even when NEXT looks
ahead, that row is already in the tuplestore (only its
classification is pending). So: retain = 1 anchor, extra +0.
- logical nav (FIRST/LAST) -- the offset is a position within the rows
mapped to that variable. Their retention costs are asymmetric.
FIRST(X.col, n) is the (n+1)-th from the front, and the front never
changes once passed, so you count X rows and latch only the single
row where the count reaches n+1 -- a counter + 1 holder, O(1)
regardless of n (later X rows are ignored). LAST(X.col, n) is the
(n+1)-th from the back, but since it is running the end keeps
moving, so you must hold X's last n+1 rows in a sliding window --
retain = n+1. bare X.col = LAST(X.col, 0) = 1.
- nested PREV/NEXT(FIRST/LAST(X.col, k), m) -- retention follows the
rule above for the inner (logical) part only: counter+holder for
FIRST, k+1 sliding for LAST. The physical part m just reads the
tuplestore from that row, so +0.
Boundary -- this bound applies to navigation references only. An
aggregate reference (sum(A.price) etc.) folds the entire prefix, so it
is absorbed into a running accumulator (5), not a window -- an O(1)
accumulated value, not a length. Do not mix the two into one
structure.
IV. Compressed storage -- one run-length spine
2-3 (during matching) and 6-8 (MEASURE replay) use the same
information under different access patterns, so their encodings
diverge:
- live / sparse (2-3) -- during matching you append only the rows
mapped to X, as they come. It is a subsequence of all rows, so to
know which row is which you must carry the pos too (2's pos list,
3's pos+varId).
- replay / dense (6-8) -- you walk the winning match from the start
over every match row, so no pos is needed -- order is the pos. This
is where compression pays off.
The replay-side structures are all one run-length spine, specialized
only by granularity:
- universal classifier (8) -- per-row varId as (varId, run) runs. No
pos; the cumulative run lengths give the position.
- per-variable membership (6) -- is-X as alternating runs: just a
repetition of (present length, absent length) pairs (alternative: a
1-bit-per-row bitmap). The case where is-A alone suffices, as in
sum(A.price).
- per-sub-member (6) -- (varId, run) within the union, when you must
tell A/B apart as in ARRAY_AGG(CLASSIFIER(U)). Rows outside U
(non-A/non-B) must be carried as runs of a "none" sentinel varId so
that order = pos is preserved.
- single-variable boundary (7) -- ARRAY_AGG(CLASSIFIER(A)) has only
the null->A boundary, so the run degenerates to essentially a single
item.
single varId RLE vs several per-variable memberships -- for the
per-sub-member case the two encodings trade off. (i) single varId RLE
(the bullet above): the per-row label comes out directly, but every
aggregate scans the whole span. (ii) several per-variable memberships:
is-A and is-B each as alternating runs (no sentinel needed) -- an
aggregate can pick just the sub-variable streams it uses, but a
CLASSIFIER(U) label must be read by merging several streams.
bitmap vs RLE -- a match is inherently runs (the greedy A^m B^n
shape). So RLE is natural, and since consumption is a sequential scan
no index is needed. A 1-bit-per-row bitmap (as long as the match) wins
only when the runs fragment finely or random access is needed.
Values are not stored -- the spine carries only
membership/label(varId); the actual column values live in the
tuplestore, and the replay scan fetches them by pos as it goes. This
is the principle behind 6-8 repeatedly noting that values stay in the
tuplestore.
| From | Date | Subject | |
|---|---|---|---|
| Next Message | dinesh salve | 2026-07-04 17:07:19 | Re: explain plans for foreign servers |
| Previous Message | Rui Zhao | 2026-07-04 16:35:44 | Re: [PATCH] Add pg_get_table_ddl() to reconstruct CREATE TABLE statements |