Re: make add_paths_to_append_rel aware of startup cost

From: Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Zhang Mingli <zmlpostgres(at)gmail(dot)com>
Subject: Re: make add_paths_to_append_rel aware of startup cost
Date: 2023-09-17 13:42:05
Message-ID: CAKU4AWoksFSs_6K5ZCzz68Oq+3uZb5g6sb8-dXJ6cChgYQm-yw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi David,
Thanks for taking a look at this!

On Fri, Sep 15, 2023 at 3:15 PM David Rowley <dgrowleyml(at)gmail(dot)com> wrote:

> On Thu, 7 Sept 2023 at 04:37, Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com> wrote:
> > Currently add_paths_to_append_rel overlooked the startup cost for
> creating
> > append path, so it may have lost some optimization chances. After a
> glance,
> > the following 4 identifiers can be impacted.
>
> > - We shouldn't do the optimization if there are still more tables to
> join,
> > the reason is similar to has_multiple_baserels(root) in
> > set_subquery_pathlist. But the trouble here is we may inherit multiple
> > levels to build an appendrel, so I have to keep the 'top_relids' all
> the
> > time and compare it with PlannerInfo.all_baserels. If they are the
> same,
> > then it is the case we want to optimize.
>
> I think you've likely gone to the trouble of trying to determine if
> there are any joins pending because you're considering using a cheap
> startup path *instead* of the cheapest total path and you don't want
> to do that when some join will cause all the rows to be read thus
> making the plan more expensive if a cheap startup path was picked.
>

Yes, that's true. However it is not something we can't resolve, one
of the solutions is just like what I did in the patch. but currently the
main stuff which confuses me is if it is the right thing to give up the
optimization if it has more tables to join (just like set_subquery_pathlist
did).

> Instead of doing that, why don't you just create a completely new
> AppendPath containing all the cheapest_startup_paths and add that to
> the append rel. You can optimise this and only do it when
> rel->consider_startup is true.
>
> Does the attached do anything less than what your patch does?
>

We can work like this, but there is one difference from what
my current patch does. It is cheapest_startup_path vs cheapest
fraction path. For example if we have the following 3 paths with
all of the estimated rows is 100 and the tuple_fraction is 10.

Path 1: startup_cost = 60, total_cost = 80 -- cheapest total cost.
Path 2: startup_cost = 10, total_cost = 1000 -- cheapest startup cost
Path 3: startup_cost = 20, total_cost = 90 -- cheapest fractional cost

So with the patch you propose, Path 1 & Path 3 are chosen to build
append path. but with my current patch, Only path 3 is kept. IIUC,
path 3 should be the best one in this case.

We might also argue why Path 3 is kept in the first place (the children
level), I think pathkey might be one option. and even path 3 is
discarded somehow, I think only if it is the best one, we should
keep it ideally.

Another tiny factor of this is your propose isn't consistent with
what set_subquery_pathlist which uses cheapest fractional cost
and my proposal isn't consistent plain rel which uses cheapest
startup cost. We can't say which one is better, though.

If my above analysis is correct, I think the best way to handle this
is if there is no more tables to join, we use cheapest fraction cost
for all the kinds of relations, including plain relation, append rel,
subquery and so on. If we have more tables to join, we use
cheapest startup cost. On the implementation side, I want to use
RelOptInfo.tuple_fraction instead of RelOptInfo.consider_startup.
tuple_fraction = -1 means startup cost should not be considered.
tuple_fraction = 0 means cheapest startup cost should be used.
tuple_franction > 0 means cheapest fraction cost should be used.

I still don't pay enough attention to consider_param_startup in
RelOptInfo, I'm feeling the above strategy will not generate
too much overhead to the planner for now while it can provides
a better plan sometimes.

--
Best Regards
Andy Fan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-09-17 16:20:50 Re: JSON Path and GIN Questions
Previous Message Yogesh Sharma 2023-09-17 12:36:06 Re: Have better wording for snapshot file reading failure