Re: Implement missing join selectivity estimation for range types

From: Mahmoud Sakr <mahmoud(dot)sakr(at)ulb(dot)be>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, SCHOEMANS Maxime <Maxime(dot)Schoemans(at)ulb(dot)be>, Diogo Repas <diogo(dot)repas(at)gmail(dot)com>, Luo Zhicheng <zhicheng(dot)luo(at)ulb(dot)be>, Andrey Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>
Subject: Re: Implement missing join selectivity estimation for range types
Date: 2023-01-25 19:11:33
Message-ID: CAB4o4asvPN=NT7KvS9zVQjZbdsiRW5t8aQctEkW7mxc4hbBxVQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Tomas,

> I finally had time to properly read the paper today - the general
> approach mostly matches how I imagined the estimation would work for
> inequalities, but it's definitely nice to see the algorithm properly
> formalized and analyzed.

Awesome, thanks for this interest!

> What seems a bit strange to me is that the patch only deals with range
> types, leaving the scalar cases unchanged. I understand why (not having
> a MCV simplifies it a lot), but I'd bet joins on range types are waaaay
> less common than inequality joins on scalar types. I don't even remember
> seeing inequality join on a range column, TBH.
>
> That doesn't mean the patch is wrong, of course. But I'd expect users to
> be surprised we handle range types better than "old" scalar types (which
> range types build on, in some sense).
>
> Did you have any plans to work on improving estimates for the scalar
> case too? Or did you do the patch needed for the paper, and have no
> plans to continue working on this?

I fully agree. Scalars are way more important.
I join you in the call for Diogo and Zhicheng to continue this implementation,
specially after the interest you show towards this patch. The current patch
was a course project (taught by me and Maxime), which was specific to range
types. But indeed the solution generalizes well to scalars. Hopefully after the
current exam session (Feb), there will be time to continue the implementation.
Nevertheless, it makes sense to do two separate patches: this one, and
the scalars. The code of the two patches is located in different files, and
the estimation algorithms have slight differences.

> I'm also wondering about not having MCV for ranges. I was a bit
> surprised we don't build MCV in compute_range_stats(), and perhaps we
> should start building those - if there are common ranges, this might
> significantly improve some of the estimates (just like for scalar
> columns). Which would mean the estimates for range types are just as
> complex as for scalars. Of course, we don't do that now.

Good question. Our intuition is that MCV will not be useful for ranges.
Maxime has done an experiment and confirmed this intuition. Here is his
experiment and explanation:
Create a table with 126000 int4ranges.
All ranges have their middle between 0 and 1000 and a length between 90
and 110.
The ranges are created in the following way:
- 10 different ranges, each duplicated 1000 times
- 20 different ranges, each duplicated 500 times
- 40 different ranges, each duplicated 100 times
- 200 different ranges, each duplicated 10 times
- 100000 different ranges, not duplicated
Two such tables (t1 and t2) were created in the same way but with different
data. Then he ran the following query:

EXPLAIN ANALYZE SELECT count(*)
FROM t, t2 WHERE t && t2

The results (using our patch) where the following:
Plan rows: 2991415662
Actual rows: 2981335423

So the estimation accuracy is still fairly good for such data with a lot
of repeated values, even without having MCV statistics.
The only error that can happen in our algorithms comes from the last bin in
which we assume uniform distribution of values. Duplicate values
(end bound, not ranges) might make this assumption be wrong, which would
create inaccurate estimations. However, this is still only incorrect
for the last
bin and all the others are correct.
MCV's are mainly useful for equality, which is not an operation we cover, and
probably not an important predicate for ranges. What do you think?

Best regards,
Mahmoud

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrey Chudnovsky 2023-01-25 19:14:18 Re: Proposal: Support custom authentication methods using hooks
Previous Message Peter Geoghegan 2023-01-25 19:08:04 Re: [PATCH] Make ON CONFLICT DO NOTHING and ON CONFLICT DO UPDATE consistent