Re: planner chooses incremental but not the best one

From: ywgrit <yw987194828(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: planner chooses incremental but not the best one
Date: 2023-12-22 08:20:43
Message-ID: CAPt2h2YMkSOKUJH0DtPWNOa0ywY3=nVEqbFv0yqruT0YmeqqhA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

The possible solution of one scenario I can come up with so far is the
query's predicate columns and group columns belonging to one table .

For a query that contains where clause, perhaps num_groups could be
estimated according to the following formula.

num_groups = ndistinct(pred_col_1, pred_col_2, ... pred_col_n) with where
clause * ndistinct(pred_col_1, pred_col_2, ... pred_col_n, sort_col_1,
sort_col_2, ... sort_col_m) / ndistinct(pred_col_1, pred_col_2, ...
pred_col_n).

ndistinct(pred_col_1, pred_col_2, ... pred_col_n) with where clause =
ndistinct(pred_var_1, pred_var_2, ... pred_var_n) * selectivity of rel.

pred_col_n belong to the columns involved in the where clause and
sort_col_m belong to the columns involved in the group by clause.

The reason for multiplying by selectivity of rel directly is that the
selectivity of rel depends on only pred_col not sort_col. So the above
formula can be simplified as follows.

num_groups = ndistinct(pred_col_1, pred_col_2, ... pred_col_n, sort_col_1,
sort_col_2, ... sort_col_m) * selectivity of rel.

The correctness of the above formula depends on the following conditions.

-

ndistinct(pred_col_1, pred_col_2, ... pred_col_n)* ndistinct(pred_col_1,
pred_col_2, ... pred_col_n, sort_col_1, sort_col_2, ... sort_col_m)
statistics already exist, and need be accurate.
-

Both pred_col_n and sort_col_m are uniformly distributed, if not,
statistics such as mcv are needed for correction.
-

The tuples of rel are the number of total tuples of the table , not the
number of filtered tuples.

After experimentation, in the scenario mentioned in previous thread. The
estimate num_groups is 3, the accuracy of result strongly relies on the
uniform distribution of b, which makes ndistinct(pred_col_1, pred_col_2,
... pred_col_n) with where clause could be able to estimated accurately.

I'd like to hear your opinions.

Regards.

ywgrit.

Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> 于2023年12月18日周一 20:53写道:

>
>
> On 12/18/23 11:40, Richard Guo wrote:
> >
> > On Mon, Dec 18, 2023 at 7:31 AM Tomas Vondra
> > <tomas(dot)vondra(at)enterprisedb(dot)com <mailto:tomas(dot)vondra(at)enterprisedb(dot)com>>
> > wrote:
> >
> > Oh! Now I see what you meant by using the new formula in 84f9a35e3
> > depending on how we sum tuples. I agree that seems like the right
> thing.
> >
> > I'm not sure it'll actually help with the issue, though - if I apply
> the
> > patch, the plan does not actually change (and the cost changes just a
> > little bit).
> >
> > I looked at this only very briefly, but I believe it's due to the
> > assumption of independence I mentioned earlier - we end up using the
> new
> > formula introduced in 84f9a35e3, but it assumes it assumes the
> > selectivity and number of groups are independent. But that'd not the
> > case here, because the groups are very clearly correlated (with the
> > condition on "b").
> >
> >
> > You're right. The patch allows us to adjust the estimate of distinct
> > values for appendrels using the new formula introduced in 84f9a35e3.
> > But if the restrictions are correlated with the grouping expressions,
> > the new formula does not behave well. That's why the patch does not
> > help in case [1], where 'b' and 'c' are correlated.
> >
> > OTOH, if the restrictions are not correlated with the grouping
> > expressions, the new formula would perform quite well. And in this case
> > the patch would help a lot, as shown in [2] where estimate_num_groups()
> > gives a much more accurate estimate with the help of this patch.
> >
> > So this patch could be useful in certain situations. I'm wondering if
> > we should at least have this patch (if it is right).
> >
>
> I do agree the patch seems to do the right thing, and it's worth pushing
> on it's own.
>
> >
> > If that's the case, I'm not sure how to fix this :-(
> >
> >
> > The commit message of 84f9a35e3 says
> >
> > This could possibly be improved upon in the future by identifying
> > correlated restrictions and using a hybrid of the old and new
> > formulae.
> >
> > Maybe this is something we can consider trying. But anyhow this is not
> > an easy task I suppose.
>
> Yeah, if it was easy, it'd have been done in 84f9a35e3 already ;-)
>
> The challenge is where to get usable information about correlation
> between columns. I only have a couple very rought ideas of what might
> try. For example, if we have multi-column ndistinct statistics, we might
> look at ndistinct(b,c) and ndistinct(b,c,d) and deduce something from
>
> ndistinct(b,c,d) / ndistinct(b,c)
>
> If we know how many distinct values we have for the predicate column, we
> could then estimate the number of groups. I mean, we know that for the
> restriction "WHERE b = 3" we only have 1 distinct value, so we could
> estimate the number of groups as
>
> 1 * ndistinct(b,c)
>
> I'm well aware this is only a very trivial example, and for more complex
> examples it's likely way more complicated. But hopefully it illustrates
> the general idea.
>
> The other idea would be to maybe look at multi-column MCV, and try using
> it to deduce cross-column correlation too (it could be more accurate for
> arbitrary predicates).
>
> And finally, we might try being a bit more pessimistic and look at what
> the "worst case" behavior could be. So for example instead of trying to
> estimate the real number of groups, we'd ask "What's the minimum number
> of groups we're likely to get?". And use that to cost the incremental
> sort. But I don't think we do this elsewhere, and I'm not sure we want
> to start with this risk-based approach here. It might be OK, because we
> usually expect the incremental sort to be much cheaper, ...
>
> If this something would be interested in exploring? I don't have
> capacity to work on this myself, but I'd be available for reviews,
> feedback and so on.
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Luzanov 2023-12-22 09:01:05 Re: Trigger violates foreign key constraint
Previous Message Pavel Luzanov 2023-12-22 07:59:21 Re: Trigger violates foreign key constraint