From: | Robert Haas <robertmhaas(at)gmail(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Richard Guo <guofenglinux(at)gmail(dot)com>, Alexandra Wang <alexandra(dot)wang(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "bruce(at)momjian(dot)us" <bruce(at)momjian(dot)us>, lepihov(at)gmail(dot)com |
Subject: | Re: plan shape work |
Date: | 2025-09-23 15:34:45 |
Message-ID: | CA+TgmoY39tYBVFHL4C9E3_7kPFQHJei502FhkN8qNSTXiMrk9g@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Sep 22, 2025 at 2:15 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I found time finally to look through all of these. Some notes:
Thanks. Committed 0001 and 0002 together. I agree with you that 0002
is a bit weird, but I think it is less weird than the status quo ante,
and it seems you agree. I think it's best to regard a Result node that
replaces a scan or join as being a form of scan or join that just so
happens to be able to be optimized to a degree not normally possible.
There's a good argument for calling something like this a dummy
scan/join, as I mentioned before. There's no compelling reason we have
to make that terminology change, as we noted in the earlier
discussion, but it's a useful and not incorrect way of thinking about
it.
> 0003: doesn't feel ready for commit. In the first place, you've used
> the "completed an outer join" terminology in the commit message and
> a couple of comments, but defined it nowhere. I think a para or
> two in optimizer/README to define "starting" and "completing"
> outer joins is essential. (I'm also still quite bemused by marking
> nodes that complete outer joins but not those that start them.)
> In the second place, we should not need to add two hundred lines
> of new code to createplan.c to accomplish this. Why not simply
> bms_difference the joinrel's relids from the union of the inputs'
> relids? (And no, I do not believe that computing the value two
> different ways so you can assert they're the same is adding anything
> whatsoever except wasted cycles.)
So, let me just back up a minute here and talk about overall goals.
What I originally wanted to do with this patch is ensure that the
non-RTE_JOIN RTIs from the joinrel are all mentioned in the final
plan. Without the changes to the Result node, that's not the case;
with the changes to the Result node, that is, AFAICT, now the case. It
can be argued that what has been numbered 0003 up to now is not really
necessary at all, since all it does is make it less likely that we
will introduce cases with similar problems to the Result-node case in
the future, and that could be judged unlikely enough to make 0003 not
worth committing. However, it seems like you might be proposing
keeping the patch in some form but throwing out the part of the code
that walks the plan tree to collect RTIs, and that doesn't make a
whole lot of sense to me. I have some confidence that the joinrel's
RTI set will include all of the RTIs from the input rels; what I'm
worried about is whether those RTIs survive into the final Plan.
On the other hand, I'm not sure that I'm interpreting your remarks
correctly. When you say "bms_difference the joinrel's relids from the
union of the inputs' relids" maybe you're specifically talking about
the handling of the RTE_JOIN relids, and I don't care very much how we
account for those. So I guess I need some clarification here as to
what your thinking is.
With regard to your other comments, I'm not opposed to updating the
README, or alternatively, I'm also not opposed to adjusting the
wording of the comments and commit message to avoid that particular
terminology. As far as your parenthetical comment about not marking
outer joins started, I don't actually care what we do, but as I said
before, I don't see an efficient way to identify outer joins started
at a particular level, and it isn't worth adding planner cycles for
information for which I have no concrete need.
> 0004: commit msg is not very convincing about why this is a good idea.
> It really looks like change for the sake of change, so you need to
> make a better case for it.
You're right. That commit message presupposes that the goal is already
understood, instead of explaining it. See the first message on this
thread, in the paragraph that begins "Now let's talk about problem
#2," for the justification. Quoting the most relevant part:
Subqueries sort of have names right
now, at least some of them, but it's an odd system: a CTE subquery,
for example, has the name mentioned by the user, but other kinds of
subplans just get names like "InitPlan 3" or "SubPlan 2". The real
problem, though, is that those names are only assigned after we've
FINISHED planned the subquery. If we begin planning our very first
subquery, it might turn out to be InitPlan 1 or SubPlan 1, or if while
planning it we recurse into some further subquery then *that* subquery
might become InitPlan 1 or SubPlan 1 and OUR subquery might become
InitPlan 2 or SubPlan 2 (or higher, if we find more subqueries and
recurse into them too). Thus, being given some information about how
the user wants, say, SubPlan 2 to be planned is completely useless
because we won't know whether that is us until after we've done the
planning that the user is trying to influence.
I need to figure out some good way of explaining this (hopefully not
too verbosely) in the commit message.
> Also, this output seems inconsistent:
>
> Function Scan on pg_catalog.generate_series x
> - Output: ARRAY(SubPlan 1)
> + Output: ARRAY(array_1)
> Function Call: generate_series(1, 3)
> - SubPlan 1
> + SubPlan array_1
>
> Why isn't it now "ARRAY(SubPlan array_1)"? The previous notation was
> chosen to be not-confusable with a plain function call, but you've
> lost that. Likewise for cases like
> - Group Key: (InitPlan 1).col1
> + Group Key: (minmax_1).col1
> which now looks exactly like an ordinary Var.
I don't think there's anything keeping me from making that print
InitPlan/SubPlan there. I just thought it looked a little verbose that
way. I thought that the "InitPlan 1" notation was probably due to the
fact that "1" is not a proper name, rather than (as you suggest here)
to avoid confusion with other syntax. Compare CTEs, which are also
subplans, but for which we simply print the CTE's alias name, rather
than "CTE alias_name".
> nitpick: sublinktype_to_string() should return const char *.
This bleeds into a bunch of other things. I'll poke at it and try to
figure something out. I may need help from someone with superior const
skills, but I'll give it a go.
> 0005: I'm even less convinced about there being a need for this.
> It's not that hard to see which RTIs are in which subplan,
> especially after the other changes in this patchset.
Let me just say that everything in this patch set is the result of
experimentation -- trying to write code that attempts to interpret a
Plan tree and discovering problems doing so along the way. I'm not
altogether convinced that patches 0005-0008 attack the problems in the
best possible way and I'm happy to hear suggestions for how to do it
better, but I'm pretty confident that they're all trying to tackle
real problems. The way to think about 0004 and 0005, IMHO, is this:
suppose we look at the final Plan, and we see that RTI 17 was planned
in such-and-such a way. If we want to reproduce that planning decision
in the next cycle -- or change that planning decision in the next
planning cycle -- we need to be able to figure out which subroot RTI
17 came from and what RTI it had within that subroot. With 0004 and
0005 applied, you can figure out that RTI 17 was originally RTI 5
within a subroot that was called expr_1, or whatever, and when we
replan the same query we will assign the name expr_1 to the same
subroot before we begin planning it and RTI 5 within that subroot will
refer to the same thing it did before. This assumes, of course, that
the query is the same and that inlining decisions and so forth haven't
changed and that no objects have been swapped out for objects with the
same name; but the point is that even if none of that has happened we
can't match things up in an automated way without this infrastructure,
or at least I don't know how to do it. If you do, I'm all ears. I
would much rather minimize the amount of information that we have to
store in the Plan tree or PlannedStmt and just have code to do the
necessary interpretation based on the data that's already available,
but I could not see how to make it work.
> 0006: I agree with the need for this, but the details seem messy,
> and it's not very clear that the proposed data structure would
> be convenient to use. Do we really need to rely on plan_node_id?
> Why haven't you integrated record_elided_node into
> clean_up_removed_plan_level?
>
> An idea maybe worth thinking about is that instead of completely
> eliding the plan node, we could replace it with a "no-op" plan node
> that both EXPLAIN and the executor will look right through.
> That node could carry the relid(s) that we lost. Not sure how
> messy this'd be to integrate, though.
There's certainly nothing that requires us to use this particular
scheme for identifying the locations where certain plan nodes were
elided. For example, we could put an additional List * in each plan
node and stash a list of elided nodes there rather than indexing by
plan_node_id. The disadvantage of that is simply that it increases the
size of the Plan struct for every node, and most nodes will end up
with an empty list. Indexing by plan_node_id was just my way of
getting the information that I wanted to store out of the tree that
the executor examines. We could potentially also do as you say here
and keep the Plan nodes around and then somehow arrange to ignore them
at explain and execution time, but I don't have a good idea of how we
would do that without adding overhead. In terms of the usability of
the data structure, experimentation has shown it to be serviceable.
There is obviously a risk that with many elided nodes, having to
grovel through the list of nodes looking for a certain plan ID over
and over again could be inefficient -- but large numbers of elided
nodes don't seem all that likely, so maybe it's OK; and I suppose any
code that uses this could always build a hash table if warranted. But
I'm not saying it's a perfect solution.
In terms of why record_elided_node is not integrated into
clean_up_removed_plan_level, that was just a stylistic choice.
clean_up_removed_plan_level() could instead call record_elided_node(),
or record_elided_node() could cease to exist and the logic be inlined
into clean_up_removed_plan_level().
> 0007: not sure about this either. Why not simply add those
> relid(s) to the surviving node's apprelids? Again, I don't
> love the amount of code and data structure that's being added
> for a hypothetical use-case. It's not even clear that this
> form of the data structure would be helpful to anyone.
I'd say that is the patch I'm least sure about. Note that my goal for
this commitfest was to get 0001-0004 committed, partly because I
wasn't too sure whether the later patches might need some adjustment.
My intuition is that flattening the relid sets together loses
important information, but I don't have a test case proving that near
at hand, so maybe it doesn't. However, I'm fairly certain that it is
at least a lot more convenient to have the information in this form.
In general, if we see an Append or MergeAppend node generated from a
RelOptInfo with a single RTI, that's a partition-wise scan of a
partitioned relation, and if we see an Append or MergeAppend node
generated from a RelOptInfo with more than one RTI, that's a
partition-wise join of all those relations. So, very naively, if we
just combine the relid sets for a bunch of partitionwise scans into a
single RTI set, it looks like we've got a partitionwise join, but we
don't. Now, if it's only set operations that result in multiple levels
of Append/MergeAppend nodes getting collapsed, it might be possible to
disentangle what actually happened by partitioning the final set of
RTIs by the subroot from which they originate. I'm not sure that's how
it works, though. At any rate, I agree with you that this is not
adequately motivated at present. Having said that, at ten thousand
feet, the motivation here is to be able to figure out from the final
plan tree where exactly we switched to partitionwise operation --
whether that was done for some joinrel or only when we got down to the
underlying baserel.
--
Robert Haas
EDB: http://www.enterprisedb.com
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2025-09-23 15:34:56 | Re: Fix overflow of nbatch |
Previous Message | Srirama Kucherlapati | 2025-09-23 15:28:58 | RE: AIX support |