initial pruning in parallel append

From: Amit Langote <amitlangote09(at)gmail(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: initial pruning in parallel append
Date: 2023-06-27 13:22:33
Message-ID: CA+HiwqFA=swkzgGK8AmXUNFtLeEXFJwFyY3E7cTxvL46aa1OTw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

In an off-list chat, Robert suggested that it might be a good idea to
look more closely into $subject, especially in the context of the
project of moving the locking of child tables / partitions to the
ExecInitNode() phase when executing cached generic plans [1].

Robert's point is that a worker's output of initial pruning which
consists of the set of child subplans (of a parallel-aware Append or
MergeAppend) it considers as valid for execution may not be the same
as the leader's and that of other workers. If that does indeed
happen, it may confuse the Append's parallel-execution code, possibly
even cause crashes, because the ParallelAppendState set up by the
leader assumes a certain number and identity (?) of
valid-for-execution subplans.

So he suggests that initial pruning should only be done once in the
leader and the result of that put in the EState for
ExecInitParallelPlan() to serialize to pass down to workers. Workers
would simply consume that as-is to set the valid-for-execution child
subplans in its copy of AppendState, instead of doing the initial
pruning again. Actually, earlier patches at [1] had implemented that
mechanism (remembering the result of initial pruning and using it at a
later time and place), because the earlier design there was to move
the initial pruning on the nodes in a cached generic plan tree from
ExecInitNode() to GetCachedPlan(). The result of initial pruning done
in the latter would be passed down to and consumed in the former using
what was called PartitionPruneResult nodes.

Maybe that stuff could be resurrected, though I was wondering if the
risk of the same initial pruning steps returning different results
when performed repeatedly in *one query lifetime* aren't pretty
minimal or maybe rather non-existent? I think that's because
performing initial pruning steps entails computing constant and/or
stable expressions and comparing them with an unchanging set of
partition bound values, with comparison functions whose result is also
presumed to be stable. Then there's also the step of mapping the
partition indexes as they appear in the PartitionDesc to the indexes
of their subplans under Append/MergeAppend using the information
contained in PartitionPruneInfo (subplan_map) and the result of
mapping should be immutable too.

I considered that the comparison functions that
match_clause_to_partition_key() obtains by calling get_opfamily_proc()
may in fact not be stable, though that doesn't seem to be a worry at
least with the out-of-the-box pg_amproc collection:

select amproc, p.provolatile from pg_amproc, pg_proc p where amproc =
p.oid and p.provolatile <> 'i';
amproc | provolatile
---------------------------+-------------
date_cmp_timestamptz | s
timestamp_cmp_timestamptz | s
timestamptz_cmp_date | s
timestamptz_cmp_timestamp | s
pg_catalog.in_range | s
(5 rows)

Is it possible for a user to add a volatile procedure to pg_amproc?
If that's possible, match_clause_to_partition_key() may pick one as a
comparison function for pruning, because it doesn't actually check the
procedure's provolatile before doing so. I'd hope not, though would
like to be sure to support what I wrote above.

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

[1] https://commitfest.postgresql.org/43/3478/

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-06-27 14:12:40 Re: Assert !bms_overlap(joinrel->relids, required_outer)
Previous Message Alena Rybakina 2023-06-27 13:19:48 Re: POC, WIP: OR-clause support for indexes