Re: generic plans and "initial" pruning

From: Amit Langote <amitlangote09(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Tender Wang <tndrwang(at)gmail(dot)com>, Alexander Lakhin <exclusion(at)gmail(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Robert Haas <robertmhaas(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Daniel Gustafsson <daniel(at)yesql(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Thom Brown <thom(at)linux(dot)com>
Subject: Re: generic plans and "initial" pruning
Date: 2025-11-12 14:17:43
Message-ID: CA+HiwqEF9SgKyQ1HrYOURpv8DGRGHDNwBT9Y6yEBVCW+=kh_=w@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Tue, Jul 22, 2025 at 3:43 PM Amit Langote <amitlangote09(at)gmail(dot)com> wrote:
> On Thu, Jul 17, 2025 at 9:11 PM Amit Langote <amitlangote09(at)gmail(dot)com> wrote:
> > The refinements I described in my email above might help mitigate some
> > of those executor-related issues. However, I'm starting to wonder if
> > it's worth reconsidering our decision to handle pruning, locking, and
> > validation entirely at executor startup, which was the approach taken
> > in the reverted patch.
> >
> > The alternative approach, doing initial pruning and locking within
> > plancache.c itself (which I floated a while ago), might be worth
> > revisiting. It avoids the complications we've discussed around the
> > executor API and preserves the clear separation of concerns that
> > plancache.c provides, though it does introduce some new layering
> > concerns, which I describe further below.
> >
> > To support this, we'd need a mechanism to pass pruning results to the
> > executor alongside each PlannedStmt. For each PartitionPruneInfo in
> > the plan, that would include the corresponding PartitionPruneState and
> > the bitmapset of surviving relids determined by initial pruning. Given
> > that a CachedPlan can contain multiple PlannedStmts, this would
> > effectively be a list of pruning results, one per statement. One
> > reasonable way to handle that might be to define a parallel data
> > structure, separate from PlannedStmt, constructed by plancache.c and
> > carried via QueryDesc. The memory and lifetime management would mirror
> > how ParamListInfo is handled today, leaving the executor API unchanged
> > and avoiding intrusive changes to PlannedStmt.
> >
> > However, one potentially problematic aspect of this design is managing
> > the lifecycle of the relations referenced by PartitionPruneState.
> > Currently, partitioned table relations are opened by the executor
> > after entering ExecutorStart() and closed automatically by
> > ExecEndPlan(), allowing cleanup of pruning states implicitly. If we
> > perform initial pruning earlier, we'd need to keep these relations
> > open longer, necessitating explicit cleanup calls (e.g., a new
> > FinishPartitionPruneState()) invoked by the caller of the executor,
> > such as from ExecutorEnd() or even higher-level callers. This
> > introduces some questionable layering by shifting responsibility for
> > relation management tasks, which ideally belong within the executor,
> > into its callers.
> >
> > My sense is that the complexity involved in carrying pruning results
> > via this parallel data structure was one of the concerns Tom raised
> > previously, alongside the significant pruning code refactoring that
> > the earlier patch required. The latter, at least, should no longer be
> > necessary given recent code improvements.
>
> One point I forgot to mention about this approach is that we'd also
> need to ensure permissions on parent relations are checked before
> performing initial pruning in plancache.c, since pruning may involve
> evaluating user-provided expressions. So in effect, we'd need to
> invoke not just ExecDoInitialPruning(), but also
> ExecCheckPermissions(), or some variant of it, prior to executor
> startup. While manageable, it does add slightly to the complexity.

Sorry for the absence. I've now implemented the approach mentioned
above and split it into a series of reasonably isolated patches.

The key idea is to avoid taking unnecessary locks when reusing a
cached plan. To achieve that, we need to perform initial partition
pruning during cached plan reuse in plancache.c so that only surviving
partitions are locked. This requires some plumbing to reuse the result
of this "early" pruning during executor startup, because repeating the
pruning logic would be both inefficient and potentially inconsistent
-- what if you get different results the second time? (I don't have
proof that this can happen, but some earlier emails mention the
theoretical risk, so better to be safe.)

So this patch introduces ExecutorPrep(), which allows executor
metadata such as initial pruning results (valid subplan indexes) and
full unpruned_relids to be computed ahead of execution and reused
later by ExecutorStart() and during QueryDesc setup in parallel
workers using the results shared by the leader. The parallel query bit
was discussed previously at [1], though I didn’t have a solution I
liked then.

This revives an idea that was last implemented in the patch (v30)
posted on Dec 16, 2022. In retrospect, I understand the hesitation Tom
might have had about the patch at the time -- its changes to enable
early pruning and then feed the results into ExecutorStart() were less
than pretty. Thanks to the initial pruning code refactoring that I
committed in Postgres 18, those changes now seem much more principled
and modular IMO.

The patch set is structured as follows:

* Refactor partition pruning initialization (0001): separates the
setup of the pruning state from its execution by introducing
ExecCreatePartitionPruneStates(). This makes the pruning logic easier
to reuse and adds flexibility to do only the setup but skip pruning in
some cases.

* Introduce ExecutorPrep infrastructure (0002): adds ExecutorPrep()
and ExecPrep as a formal way to perform executor setup ahead of
execution. This enables caching or transferring pruning results and
other metadata without triggering execution. ExecutorStart() can now
consume precomputed prep state from the EState created during
ExecutorPrep(). ExecPrepCleanup() handles cleanup when the plan is
invalidated during prep and so not executed; the state is cleaned up
in the regular ExecutorEnd() path otherwise.

* Allow parallel workers to reuse leader pruning results (0003): lets
workers reuse the leader’s initial pruning results (valid subplan
indexes) and unpruned_relids via ExecutorPrep(). This adds a
verification step to check that leader and worker decisions match,
throwing an error if they don’t -- so "reuse" is a bit of a lie.
Should that check be debug-only? (Maybe not.) As mentioned above, this
was previously discussed at [1].

* Enable pruning-aware locking in cached / generic plan reuse (0004):
extends GetCachedPlan() and CheckCachedPlan() to call ExecutorPrep()
on each PlannedStmt in the CachedPlan, locking only surviving
partitions. Adds CachedPlanPrepData to pass this through plan cache
APIs and down to execution via QueryDesc. Also reinstates the
firstResultRel locking rule added in 28317de72 but later lost due to
revert of the earlier pruning patch, to ensure correctness when all
target partitions are pruned.

This approach keeps plan caching and validation logic self-contained
in plancache.c, avoids invasive executor API changes.

Benchmark results:

echo "plan_cache_mode = force_generic_plan" >> $PGDATA/postgresql.conf
for p in 32 64 128 256 512 1024; do pgbench -i --partitions=$p >
/dev/null 2>&1; echo -ne "$p\t"; pgbench -n -S -T10 -Mprepared | grep
tps; done

Master

32 tps = 23841.822407 (without initial connection time)
64 tps = 21578.619816 (without initial connection time)
128 tps = 18090.500707 (without initial connection time)
256 tps = 14152.248201 (without initial connection time)
512 tps = 9432.708423 (without initial connection time)
1024 tps = 5873.696475 (without initial connection time)

Patched

32 tps = 24724.245798 (without initial connection time)
64 tps = 24858.206407 (without initial connection time)
128 tps = 24652.655269 (without initial connection time)
256 tps = 23656.756615 (without initial connection time)
512 tps = 22299.865769 (without initial connection time)
1024 tps = 21911.704317 (without initial connection time)

Comments welcome.

[1] https://www.postgresql.org/message-id/CA%2BHiwqFA%3DswkzgGK8AmXUNFtLeEXFJwFyY3E7cTxvL46aa1OTw%40mail.gmail.com

--
Thanks, Amit Langote

Attachment Content-Type Size
v1-0003-Reuse-partition-pruning-results-in-parallel-worke.patch application/octet-stream 9.0 KB
v1-0001-Refactor-partition-pruning-initialization-for-cla.patch application/octet-stream 7.7 KB
v1-0004-Use-pruning-aware-locking-in-cached-plans.patch application/octet-stream 25.0 KB
v1-0002-Introduce-ExecutorPrep-infrastructure-for-pre-exe.patch application/octet-stream 29.9 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2025-11-12 14:19:34 Re: [Patch] Windows relation extension failure at 2GB and 4GB
Previous Message Thomas Munro 2025-11-12 14:17:09 Re: alignas (C11)