Re: Row pattern recognition

From: Henson Choi <assam258(at)gmail(dot)com>
To: Tatsuo Ishii <ishii(at)postgresql(dot)org>, jian he <jian(dot)universality(at)gmail(dot)com>
Cc: zsolt(dot)parragi(at)percona(dot)com, sjjang112233(at)gmail(dot)com, vik(at)postgresfriends(dot)org, er(at)xs4all(dot)nl, jacob(dot)champion(at)enterprisedb(dot)com, david(dot)g(dot)johnston(at)gmail(dot)com, peter(at)eisentraut(dot)org, li(dot)evan(dot)chao(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Row pattern recognition
Date: 2026-06-02 23:50:28
Message-ID: CAAAe_zDz3z2Paidk3jHOm9S3eMVLoXRxK0Lyo=5i_9-EfSH7fA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi all,

While going over the row pattern grammar I want to put on record why the
empty pattern -- PATTERN (()) and the like -- is left unsupported, and why
I think that is the right call for this series rather than an oversight.

Today it is simply a syntax error. row_pattern_primary is either a
variable (ColId) or a parenthesized group '(' row_pattern ')', and
row_pattern has no empty production, so '()' never parses. The standard's
pattern syntax does allow an empty row pattern -- it matches the empty
sequence, i.e. it produces an empty match -- so the question is whether we
should grow the grammar, plus an NFA "empty" element, to accept it.

My claim is that we do not need to: every way an empty pattern can appear
reduces to something we already handle, so a dedicated empty element in
the executor would be dead weight. There are two cases.

1. The empty pattern is the whole pattern: PATTERN (())

This pattern has zero pattern variables. But DEFINE is mandatory (per
ISO/IEC 19075-5, Table 18), an empty DEFINE list is rejected, and every
DEFINE variable must appear in PATTERN -- otherwise we already error with
"DEFINE variable \"%s\" is not used in PATTERN". A pattern with no
variables cannot satisfy any of that: there is nothing for DEFINE to
define, yet DEFINE can be neither omitted nor left empty. So an all-empty
pattern is rejected by the existing rules, with no new check needed.

(It is degenerate in any case. An empty match at every row maps to "no
reduced frame" -- row_is_in_reduced_frame() returns -1 for RF_EMPTY_MATCH
exactly as it does for RF_UNMATCHED -- so every row would simply be
unmatched.)

2. The empty pattern is embedded: A () B

Here '()' consumes no rows; it is the identity element under
concatenation, so A () B is equivalent to A B. This stays entirely on the
parser/optimizer side: the empty-pattern production would carry an empty
AST node, and an identity fold in the SEQ simplification (next to the
prefix/suffix and consecutive merges it already does) drops the '()'
before the executor ever sees it. The runtime is untouched -- which is
the whole point. And if the fold removes everything, the pattern has
collapsed to case 1 and is rejected there.

The AST rewrites it would add:

Concatenation (identity element):
A () B -> A B
() A -> A
A () -> A
A () () B -> A B

Group unwrap (single non-empty child):
(A ()) -> (A) -> A
(() A) -> (A) -> A

Quantified empty (empty repeated is still empty):
()* ()+ ()? (){3} -> (removed)
A ()* B -> A B

Collapses to case 1, then rejected:
(()) (() ()) ((())) -> () -> rejected as case 1

The one form that is not pure deletion is an empty alternative. Here the
empty branch is optionality, not nothing, but it is exactly the '?'
quantifier applied to the rest -- X | () is (X)?:

X | () -> (X)? (and () | X -> (X)??, reluctant)
A | () -> A? (single var: group is redundant)
A{2} | () -> (A{2})? (the group is required here -- A{2}?
would parse as a reluctant A{2}, not
an optional one)
A B | () -> (A B)? (X may be a sequence)
A | B | () -> (A | B)? (... or an alternation)
(A | ()) B -> A? B

The branch order maps to greediness ('X | ()' greedy, '() | X' reluctant).
Either way it rewrites to a group plus '?', both of which we already
support, so it still needs no new machinery.

The rewrites also compose across nesting. A nested empty alternation
applies the rule at each level, producing a nested nullable quantifier:

(A | ()) | ()
-> A? | () (inner A | () -> A?)
-> (A?)? (outer X | () -> (X)?, X = A?)

and (A?)? = A?, (A+)? = A*, (A*)? = A* are all ordinary nullable
constructs, never a new shape:

A? | () -> (A?)? (= A?)
A+ | () -> (A+)? (= A*)
A* | () -> (A*)? (= A*)

Whether the optimizer collapses these to a single quantifier or leaves the
group as-is doesn't matter here -- both are nullable groups the runtime
already handles, so still no executor-level element. (The optimizer
leaves these unfolded: a non-exact outer over a nullable child falls
outside tryMultiplyQuantifiers' conservative guard, which skips the whole
class that can have gaps (e.g. (A{2}){2,3} = {4,6}). That is an
over-rejection, not a correctness defect -- folding the nullable cases
would be a safe optional extension; see the aside below.)

(Duplicate empty branches would be merged first by the alternation dedup,
so A | () | () reduces through A | () to A? just the same.)

Putting it together: across the board an empty pattern is either

(a) rejected by the existing DEFINE rules (case 1),
(b) the concatenation identity, removable by the AST simplification
(case 2), or
(c) an existing nullable construct ('?', the empty alternative),

and none of these would ever need an executor-level empty element.

Given that, and that the feature has little practical use while the
current series is already sizable, I am leaving the empty pattern out of
scope: it stays a syntax error. Concretely, accepting it would take a
grammar production, an empty AST node, the concatenation-identity deletion
in the SEQ pass, and (optionally) an extension of the quantifier
composition to fold the nullable cases -- all new work, none of it in this
series. We can revisit it as that small follow-up if anyone actually
wants it.

Thanks,
Henson

------------------------------------------------------------------------
Aside (separable from the post above): quantifier-composition gaps,
unrelated to the empty pattern
------------------------------------------------------------------------

Independent of the empty pattern, the AST-level quantifier-composition
pass (tryMultiplyQuantifiers, rpr.c) over-rejects: it collapses a nested
quantifier (X{p,q}){m,n} into a single X{...} only when the outer is exact
(m == n), the child is {1,1}, or both are unbounded. Those are the
always-safe sufficient conditions. But the full foldable class is larger:
(X{p,q}){m,n} is foldable to X{m*p, n*q} whenever the reachable repetition
counts -- the union over k in [m,n] of [k*p, k*q] -- are contiguous (no
gaps). That holds for every nullable child (p == 0) and, more generally,
whenever the child span q-p is wide enough to bridge the jump between k and
k+1 iterations. The guard thus rejects a strict superset of the genuinely
unsafe (gap-producing) patterns: a class of provably-safe folds is left
undone, and those patterns reach the executor as nested groups -- handled
correctly, just not normalized. This is an optimization over-rejection,
not a correctness defect: the output is identical either way.

The two rules side by side (outer {m,n} over child {p,q}, INF = unbounded):

current guard -- folds iff:
m = n (outer exact)
OR (p = 1 AND q = 1) (child {1,1})
OR (q = INF AND n = INF AND NOT(m = 0 AND p >= 2)) (both unbounded)

true safe set -- foldable iff the reachable counts are contiguous:
m = n (single interval)
OR p = 0 (nullable child)
OR ( p <= max(m,1)*(q - p) + 1 (child span bridges the
AND (m >= 1 OR p <= 1) ) k -> k+1 jump; child
{1,1}
is the p=q=1 instance)

over-rejected = safe AND NOT current:
the non-exact-outer (m != n) folds over a gap-free child -- p = 0
(nullable) or a span-bridging child. e.g. (A?)?, (A*)?, (A+)?,
(A?){2,3}, (A+){0,2}, (A{1,3}){2,4} (all in the table below).
The boundaries (A{2}){2,3} and (A{2,})* are NOT in the safe set
(p = 2 fails p <= max(m,1)*(q-p)+1), so both rules reject them --
correctly; widening the guard must keep doing so.

Verified on the branch -- EXPLAIN's "Pattern:" line shows the optimized
form (a nested group means it was not folded; a flat quantifier means it
was):

folds today (correct):
(A){2,3} -> a{2,3} (child {1,1})
(A{2,3}){2} -> a{4,6} (outer exact)
(A*)* -> a* (both unbounded)

gap-free but NOT folded (stays nested; "want" is the safe normal form):
(A?)? -> (a?)? want a?
(A*)? -> (a*)? want a*
(A+)? -> (a+)? want a*
(A?){2,3} -> (a?){2,3} want a{0,3}
(A+){0,2} -> (a+){0,2} want a*
(A{1,3}){2,4} -> (a{1,3}){2,4} want a{2,12}

correctly NOT foldable (real gap -- guard is right here):
(A{2}){2,3} -> (a{2}){2,3} reaches {4,6}, not 4..6

The first three "not folded" rows are the nullable forms the empty pattern
would lean on; the next three show the gap is broader (a range outer over
a non-{1,1} child can still be gap-free). The win is only normalization /
a smaller NFA, so this is low priority. And this exact pass has a history
of subtle gap bugs (the (A{2,})* over-flatten and the INF-sum merge, both
recently fixed), so widening the guard toward the safe set above should be
its own patch, with boundary tests covering the foldable cases and the two
gaps -- (A{2}){2,3} and (A{2,})* -- that must stay un-folded.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Sami Imseih 2026-06-02 23:53:43 Re: Improve pg_stat_statements scalability
Previous Message Sami Imseih 2026-06-02 23:34:32 Re: Report oldest xmin source when autovacuum cannot remove tuples