Re: Wrong rows estimations with joins of CTEs slows queries by more than factor 500

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Richard Guo <guofenglinux(at)gmail(dot)com>
Cc: Jian Guo <gjian(at)vmware(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Hans Buschmann <buschmann(at)nidsa(dot)net>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Zhenghua Lyu <zlyu(at)vmware(dot)com>
Subject: Re: Wrong rows estimations with joins of CTEs slows queries by more than factor 500
Date: 2023-11-17 03:38:19
Message-ID: 482957.1700192299@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Richard Guo <guofenglinux(at)gmail(dot)com> writes:
> On Fri, Nov 17, 2023 at 2:16 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> So you could argue that there's more to do here, but I'm hesitant
>> to go further. Part of the point of MATERIALIZED is to be an
>> optimization fence, so breaking down that fence is something to be
>> wary of. Maybe we shouldn't even take this patch --- but on
>> balance I think it's an OK compromise.

> Agreed. I think the patch is still valuable on its own, although it
> does not go down into MATERIALIZED case for further optimization.

Right. My earlier response was rather rushed, so let me explain
my thinking a bit more.

When I realized that the discrepancy between MATERIALIZED-and-not
plans was due to the upper planner not seeing the pathkeys for the
CTE scan, my first thought was to try to export those pathkeys.
And my second thought was that the CTE should return multiple
potential paths, much as we do for sub-SELECT-in-FROM subqueries,
with the upper planner eventually choosing one of those paths.
But that second idea would break down the optimization fence
almost completely, because the opinions of the upper planner would
influence which plan we choose for the CTE query. I think we
shouldn't go there, at least not for a CTE explicitly marked
MATERIALIZED. (Maybe if it's not marked MATERIALIZED, but we
chose not to flatten it for some other reason, we could think
about that differently? Not sure.)

I think that when we say that MATERIALIZED is meant as an optimization
fence, what we mostly mean is that the upper query shouldn't influence
the choice of plan for the sub-query. However, we surely allow our
statistics or guesses for the sub-query to subsequently influence what
the upper planner does. If that weren't true, we shouldn't even
expose any non-default rowcount guess to the upper planner --- but
that would lead to really horrid results, so we allow that information
to percolate up from the sub-query. It seems like exposing column
statistics to the upper planner, as the proposed patch does, isn't
fundamentally different from exposing rowcount estimates.

That line of argument also leads to the conclusion that it'd be
okay to expose info about the ordering of the CTE result to the
upper planner. This patch doesn't do that, and I'm not sufficiently
excited about the issue to go write some code. But if someone else
does, I think we shouldn't exclude doing it on the grounds of wanting
to preserve an optimization fence. The fence is sort of one-way
in this line of thinking: information can propagate up to the outer
planner level, but not down into the CTE plan.

Thoughts?

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David G. Johnston 2023-11-17 03:45:44 Re: Wrong rows estimations with joins of CTEs slows queries by more than factor 500
Previous Message Richard Guo 2023-11-17 03:25:11 Re: Wrong results with grouping sets