Re: [SQL/PGQ] Early pruning for GRAPH_TABLE path generation

From: Junwang Zhao <zhjwpku(at)gmail(dot)com>
To: assam258(at)gmail(dot)com
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
Subject: Re: [SQL/PGQ] Early pruning for GRAPH_TABLE path generation
Date: 2026-05-12 00:57:09
Message-ID: CAEG8a3KUd1eMDsa2TP=gYyPWnMjD5HLgkrgp+yC_wV0RSm+sXg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Henson,

On Sat, May 9, 2026 at 10:52 PM Henson Choi <assam258(at)gmail(dot)com> wrote:
>
> Hi Junwang,
>
> Thanks for an interesting thread, and for picking this up — it has
> been a useful occasion to re-examine assumptions I had carried over
> from a different graph model.
>
> My background is on AgensGraph, where labels form an inheritance
> hierarchy (rather than Neo4j-style multi-label tag sets), so a path
> pattern rewrites against a single parent table per position and the
> fan-out across labels is left to the planner's existing inheritance
> machinery. The rewriter never enumerates per-element combinations, so
> an N^K blow-up like Satya's was simply not on the radar there.

Thanks for the input, good to know.

>
> SQL/PGQ instead designates graph identity through the schema itself,
> so each pattern position carries a set of candidate element tables
> and the rewriter has to materialize each schema-feasible join
> combination — a (vertex, edge, vertex, ...) sequence of element
> tables — as its own SELECT branch, with all such branches joined by
> UNION ALL. The number of *feasible* join combinations is bounded by
> the schema graph's connectivity and is usually small; the current
> rewriter's naive shape, however, is to enumerate the full N^K
> candidate space first and reject infeasible combinations only
> afterwards. The patch tackles exactly that part.
>
> To be clear, I do not mean this as a criticism of the original
> implementer. For sizeable features I tend to prefer a deliberately
> naive implementation first — getting the full functionality working
> end-to-end before optimizing — and that approach is even more
> valuable in team settings, since a working baseline lets additional
> contributors join much earlier. The current shape of
> generate_queries_for_path_pattern_recurse() reads exactly like such a
> baseline, and this patch is precisely the kind of focused follow-up
> improvement it invites.
>
> When I thought about this independently, I had reached for the dual
> formulation — "extend the prefix only along feasible edges" rather
> than "enumerate everything and prune" — and my intuition was that the
> two should produce the same surviving set, though I have not worked
> that through rigorously. Either way, your "early pruning" framing is
> the better one for the patch itself: it stays close to the existing
> code structure and lets the change read as a small refinement of
> generate_queries_for_path_pattern_recurse() rather than a
> reorganization of how candidates are walked. That makes review and
> incremental landing materially easier.
>
> The same reframing also explains, after the fact, why the discrepancy
> between Ashutosh's quick runs and Satya's 81 GB case was so large:
> the cost being measured is in large part infeasible enumeration,
> which a forward-moved check eliminates regardless of pattern length
> or schema size in realistic graphs.
>
> I have not gone through the patch at the code level yet, but the
> shape of the problem and of your fix is clear enough from the
> description. My personal take, with that caveat, is that early
> pruning, together with the CHECK_FOR_INTERRUPTS that Satya already
> proposed in the earlier thread [1], feels like the right pair to
> land first; the explicit hard limit, in turn, is probably best
> revisited only if a real case turns up where even those two combined
> are not enough. That keeps the immediate change narrow and
> semantics-preserving while leaving the policy question open for
> evidence rather than upfront estimation.

Agreed, I add a CHECK_FOR_INTERRUPTS to the patch now, see
the attached.

>
> There is a fairly recent precedent in PostgreSQL itself that I think
> supports this ordering. I have been following — and participating in
> — Tatsuo Ishii's Row Pattern Recognition thread [2] since earlier
> this year, and that work ran into a structurally similar concern: in
> its earlier implementation the engine could undergo a combinatorial
> explosion of pattern-candidate states. A memory-based limit was
> floated as one mitigation, but it never actually converged into a
> design — the conversation around an explicit limit simply faded out
> as a layered set of pruning and early-termination rules accumulated
> incrementally and brought the state space back under control. CFI
> and stack-depth checks, by contrast, were treated as independently
> necessary throughout and stayed in the design.
>
> The shape of this thread feels analogous, and the same staged
> approach (structural fix + CFI first, hard limit only if a residual
> case genuinely demands it) is, I suspect, what will hold up
> here as well.

Thanks for the insightful analysis, it's extremely helpful.

>
> I'll set aside time separately to do a proper line-by-line review and
> follow up.
>
> Regards,
> Henson
>
> [1] https://www.postgresql.org/message-id/CAHg%2BQDe8JU%2BERqA2xwjrg_ZptvH_v0T6PS9_P_ZgyYzD5h-Grw%40mail.gmail.com
> [2] https://www.postgresql.org/message-id/flat/20230625.210509.1276733411677577841.t-ishii%40sranhm.sra.co.jp

--
Regards
Junwang Zhao

Attachment Content-Type Size
v1-0001-Prune-non-matching-graph-path-prefixes-during-DFS.patch application/octet-stream 5.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jonathan S. Katz 2026-05-12 01:18:19 Re: 2026-05-14 release announcement draft
Previous Message Hayato Kuroda (Fujitsu) 2026-05-12 00:42:28 RE: Patch for migration of the pg_commit_ts directory