Re: Implement missing join selectivity estimation for range types

From: Mahmoud Sakr <mahmoud(dot)sakr(at)ulb(dot)be>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, SCHOEMANS Maxime <Maxime(dot)Schoemans(at)ulb(dot)be>, Diogo Repas <diogo(dot)repas(at)gmail(dot)com>, Luo Zhicheng <zhicheng(dot)luo(at)ulb(dot)be>, Andrey Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>
Subject: Re: Implement missing join selectivity estimation for range types
Date: 2023-01-18 19:23:20
Message-ID: CAB4o4aud47V_iRyWtA8+ZAmdXDjCF165R73AeCjx2RL0nzQzHA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Tomas,
Thanks for picking up the patch and for the interesting discussions that
you bring !

> Interesting. Are there any particular differences compared to how we
> estimate for example range clauses on regular columns?
The theory is the same for scalar types. Yet, the statistics that are currently
collected for scalar types include other synopsis than the histogram, such
as MCV, which should be incorporated in the estimation. The theory for using
the additional statistics is ready in the paper, but we didn't yet implement it.
We thought of sharing the ready part, till the time allows us to implement the
rest, or other developers continue it.

> Right. I think 0.5% is roughly expected for the default statistics
> target, which creates 100 histogram bins, each representing ~1% of the
> values. Which on average means ~0.5% error.
Since this patch deals with join selectivity, we are then crossing 100*100 bins.
The ~0.5% error estimation comes from our experiments, rather than a
mathematical analysis.

> independence means we take (P1*P2). But maybe we should be very
> pessimistic and use e.g. Min(P1,P2)? Or maybe something in between?
>
> Another option is to use the length histogram, right? I mean, we know
> what the average length is, and it should be possible to use that to
> calculate how "far" ranges in a histogram can overlap.
The independence assumption exists if we use the lower and upper
histograms. It equally exists if we use the lower and length histograms.
In both cases, the link between the two histograms is lost during their
construction.
You discussion brings an interesting trade-off of optimistic v.s. pessimistic
estimations. A typical way to deal with such a trade-off is to average the
two, for example is model validation in machine learning, Do you think we
should implement something like
average( (P1*P2), Min(P1,P2) )?

> OK. But ideally we'd cross-check elements of the two multiranges, no?

> So if the column(s) contain a couple very common (multi)ranges that make
> it into an MCV, we'll ignore those? That's a bit unfortunate, because
> those MCV elements are potentially the main contributors to selectivity.
Both ideas would require collecting more detailed statistics, for
example similar
to arrays. In this patch, we restricted ourselves to the existing statistics.

> Also, calc_hist_selectivity_contains in multirangetypes_selfuncs.c needs
> a proper comment, not just "this is a copy from rangetypes".
Right, the comment should elaborate more that the collected statistics are
currently that same as rangetypes but may potentially deviate.

> However, it seems the two functions are exactly the same. Would the
> functions diverge in the future? If not, maybe there should be just a
> single shared function?
Indeed, it is possible that the two functions will deviate if that statistics
of multirange types will be refined.

--
Best regards
Mahmoud SAKR

On Wed, Jan 18, 2023 at 7:07 PM Tomas Vondra
<tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>
> Also, calc_hist_selectivity_contains in multirangetypes_selfuncs.c needs
> a proper comment, not just "this is a copy from rangetypes".
>
> However, it seems the two functions are exactly the same. Would the
> functions diverge in the future? If not, maybe there should be just a
> single shared function?
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Laurenz Albe 2023-01-18 19:26:10 Re: document the need to analyze partitioned tables
Previous Message Tomas Vondra 2023-01-18 19:09:21 Re: Parallelize correlated subqueries that execute within each worker