Re: Implement missing join selectivity estimation for range types

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Mahmoud Sakr <mahmoud(dot)sakr(at)ulb(dot)be>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, SCHOEMANS Maxime <Maxime(dot)Schoemans(at)ulb(dot)be>, Diogo Repas <diogo(dot)repas(at)gmail(dot)com>, Luo Zhicheng <zhicheng(dot)luo(at)ulb(dot)be>, Andrey Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>
Subject: Re: Implement missing join selectivity estimation for range types
Date: 2023-01-20 14:25:11
Message-ID: 84ffb566-8038-ab35-c841-7a5e5728a247@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 1/18/23 20:23, Mahmoud Sakr wrote:
> Hi Tomas,
> Thanks for picking up the patch and for the interesting discussions that
> you bring !
>
>> Interesting. Are there any particular differences compared to how we
>> estimate for example range clauses on regular columns?
>
> The theory is the same for scalar types. Yet, the statistics that are currently
> collected for scalar types include other synopsis than the histogram, such
> as MCV, which should be incorporated in the estimation. The theory for using
> the additional statistics is ready in the paper, but we didn't yet implement it.
> We thought of sharing the ready part, till the time allows us to implement the
> rest, or other developers continue it.
>

I see. We don't have MCV stats for range types, so the algorithms don't
include that. But we have that for scalars, so the code would need to be
modified to consider that too.

However, I wonder how much could that improve the estimates for range
queries on scalar types. I mean, we already get pretty good estimates
for those, so I guess we wouldn't get much.

>> Right. I think 0.5% is roughly expected for the default statistics
>> target, which creates 100 histogram bins, each representing ~1% of the
>> values. Which on average means ~0.5% error.
> Since this patch deals with join selectivity, we are then crossing 100*100 bins.
> The ~0.5% error estimation comes from our experiments, rather than a
> mathematical analysis.
>

Ah, understood. Even for joins there's probably a fairly close
relationship between the bin size and estimation error, but it's
certainly more complex.

BTW the experiments are those described in section 6 of the paper,
correct? I wonder how uniform (or skewed) the data was - in terms of
range length, etc. Or how it works for other operators, not just for
"<<" as in the paper.

>> independence means we take (P1*P2). But maybe we should be very
>> pessimistic and use e.g. Min(P1,P2)? Or maybe something in between?
>>
>> Another option is to use the length histogram, right? I mean, we know
>> what the average length is, and it should be possible to use that to
>> calculate how "far" ranges in a histogram can overlap.
> The independence assumption exists if we use the lower and upper
> histograms. It equally exists if we use the lower and length histograms.
> In both cases, the link between the two histograms is lost during their
> construction.
> You discussion brings an interesting trade-off of optimistic v.s. pessimistic
> estimations. A typical way to deal with such a trade-off is to average the
> two, for example is model validation in machine learning, Do you think we
> should implement something like
> average( (P1*P2), Min(P1,P2) )?
>

I don't know.

AFAICS the independence assumption is used not only because it's very
cheap/simple to implement, but also because it actually is a reasonable
assumption for a fair number of data sets (particularly in OLTP).

You're right it's an optimistic estimate, but for many data sets it's
actually quite reasonable.

I'm not sure that applies to range boundaries - the upper/lower bounds
seem pretty strongly correlated. So maybe using a more pessimistic
formula would be appropriate.

I was thinking the length histogram might allow an alternative,
approach, because it says what fraction of ranges has what length. So
for a "fixed" lower boundary, we may check each of those fractions. Of
course, this assumes consistent range length distribution (so if ranges
at one end are much longer, that won't work).

>> OK. But ideally we'd cross-check elements of the two multiranges, no?
>
>> So if the column(s) contain a couple very common (multi)ranges that make
>> it into an MCV, we'll ignore those? That's a bit unfortunate, because
>> those MCV elements are potentially the main contributors to selectivity.
> Both ideas would require collecting more detailed statistics, for
> example similar
> to arrays. In this patch, we restricted ourselves to the existing statistics.
>

Ah, I didn't realize we don't actually build MCV for range types. In
that case the current behavior makes perfect sense.

>
>> Also, calc_hist_selectivity_contains in multirangetypes_selfuncs.c needs
>> a proper comment, not just "this is a copy from rangetypes".
> Right, the comment should elaborate more that the collected statistics are
> currently that same as rangetypes but may potentially deviate.
>
>> However, it seems the two functions are exactly the same. Would the
>> functions diverge in the future? If not, maybe there should be just a
>> single shared function?
> Indeed, it is possible that the two functions will deviate if that statistics
> of multirange types will be refined.
>

Right, but are there any such plans? Also, what's the likelihood we'll
add new statistics to only one of the places (e.g. for multiranges but
not plain ranges)?

I'd keep a single function until we actually need two. That's also
easier for maintenance - with two it's easy to fix a bug in one place
and forget about the other, etc.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Imseih (AWS), Sami 2023-01-20 14:38:33 Re: Add index scan progress to pg_stat_progress_vacuum
Previous Message Robert Haas 2023-01-20 13:54:27 Re: Add SHELL_EXIT_CODE to psql