Re: BUG #16759: Estimation of the planner is wrong for hash join

From: Bertrand Guillaumin <bertrand(dot)guillaumin(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #16759: Estimation of the planner is wrong for hash join
Date: 2020-12-16 15:27:19
Message-ID: CAC-tRewjTRf7y0iBjWK7MgM5bsgB0qCJuFXBLAKE73+dCPT=iA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

CC the bug mailing list.

In short the number of distinct values in the calculation should not be the
one in the statistics but be derived from the estimated size of the sample
if constant values are used.

Le mer. 16 déc. 2020 à 16:18, Bertrand Guillaumin <
bertrand(dot)guillaumin(at)gmail(dot)com> a écrit :

> Hello, so I've made some researches on how does this case is dealt with in
> Oracle and found the following document :
>
> https://www.doag.org/formes/pubfiles/6315126/2014-DB-Jonathan_Lewis-Calculating_Join_Selectivity-Manuskript.pdf
>
> The issue is the number of distinct values for nd2, as we know that the
> number of distinct values for b.id is going to be 1 (because of filtering
> on attrib), not 1000.
>
> In the document they explain the following for calculation of the number
> of distinct values on a table in a join.
>
> In a query like this :
> select * from t1,t2
> where T1.mod_300 = t2.mod_200
> and t1.date_1000=<constant>
>
>
> the selectivity is this :
> t1 has 1,000,000 rows (number of rows) call this nr
> mod_300 has 300 distinct values (number of distinct) call this nd
> date_1000 = {constant} returns a sample of 1,000 rows call this s
> The expected number of distinct values for mod_300 in the sample will be:
> nd * (1 - power(1 - s/nr, nr/nd)) In our case, 300 * (1 - power(1 -
> 1000/1000000, 1000000/1000)) = 289.3156
>
> If we apply the same reasoning to the query I posted we would get
> SELECT * FROM T1,T2
> where T1.id = T2.parent_id
> and T1.attrib='BEL'
> we get :
> nr =1000
> nd = 1000
> s = 1
> so expected number of distinct values 1000* (1 - (1-1/1000)^(1000/1000)) =
> 1000*(1-0,999)=1
> With nd2 corrected like this the result should be better, as the
> calculated selectivity is then 1*1000*(1/18)=56 far more close to the
> reality than before
>
> Hope this can help,
> best regards,
>
>
>
> Le mer. 2 déc. 2020 à 23:33, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> a écrit :
>
>> [ please keep the list cc'd ]
>>
>> Bertrand Guillaumin <bertrand(dot)guillaumin(at)gmail(dot)com> writes:
>> > I've managed to reproduce the bug but only in the join case not with
>> the in
>> > subquery (it uses hash join semi which works) with the following test
>> case
>> > :
>> > create table work.test_bug_hash as SELECT id, case when id <= 600 then
>> 100
>> > when id <=800 then 200 when id <=875 then 300 when id<=950 then 400
>> > when id<=960 then 500 when id<=970 then 600 when id<=980 then 700
>> when
>> > id<=990 then 800 when id =991 then 900 when id=992 then 910 when id=993
>> > then 920 when id=994 then 930 when id=995 then 940 when id=996 then 950
>> > when id=997 then 960 when id =998 then 970 when id=999 then 980 else 990
>> > end as parent_id, null as attrib from (select generate_series(1,1000) as
>> > id) alias0;
>>
>> > update work.test_bug_hash set attrib='BEL' where id=300;
>>
>> > analyze work.test_bug_hash;
>>
>> > explain select * from work.test_bug_hash a, work.test_bug_hash b where
>> > a.parent_id=b.id and b.attrib='BEL';
>>
>> Hmm. Trying this on HEAD, the join and IN forms both estimate rows=1,
>> which is no doubt because recent versions of eqjoinsel() clamp the
>> semijoin selectivity estimate to be not more than the plain join
>> selectivity estimate ... which is logically correct, but in this case
>> it replaces a somewhat-okay estimate with a not-very-good one.
>>
>> Anyway, the estimate you're getting from the "a = (select ...)" form
>> is the responsibility of var_eq_non_const, which has no idea what value
>> might come out of the sub-select, so it falls back to this logic:
>>
>> /*
>> * Search is for a value that we do not know a priori, but we will
>> * assume it is not NULL. Estimate the selectivity as non-null
>> * fraction divided by number of distinct values, so that we get a
>> * result averaged over all possible values whether common or
>> * uncommon. (Essentially, we are assuming that the not-yet-known
>> * comparison value is equally likely to be any of the possible
>> * values, regardless of their frequency in the table. Is that a
>> good
>> * idea?)
>> */
>>
>> Meanwhile, in the join or semijoin cases, the issue is that
>> eqjoinsel_inner has an MCV list for the parent_id side, but not
>> for the id side (because the latter is unique so it has no MCVs).
>> So it falls back on this logic:
>>
>> /*
>> * We do not have MCV lists for both sides. Estimate the join
>> * selectivity as MIN(1/nd1,1/nd2)*(1-nullfrac1)*(1-nullfrac2).
>> This
>> * is plausible if we assume that the join operator is strict and
>> the
>> * non-null values are about equally distributed: a given non-null
>> * tuple of rel1 will join to either zero or N2*(1-nullfrac2)/nd2
>> rows
>> * of rel2, so total join rows are at most
>> * N1*(1-nullfrac1)*N2*(1-nullfrac2)/nd2 giving a join
>> selectivity of
>> * not more than (1-nullfrac1)*(1-nullfrac2)/nd2. By the same
>> logic it
>> * is not more than (1-nullfrac1)*(1-nullfrac2)/nd1, so the
>> expression
>> * with MIN() is an upper bound. Using the MIN() means we
>> estimate
>> * from the point of view of the relation with smaller nd (since
>> the
>> * larger nd is determining the MIN). It is reasonable to assume
>> that
>> * most tuples in this rel will have join partners, so the bound
>> is
>> * probably reasonably tight and should be taken as-is.
>> *
>> * XXX Can we be smarter if we have an MCV list for just one
>> side? It
>> * seems that if we assume equal distribution for the other side,
>> we
>> * end up with the same answer anyway.
>> */
>>
>> In the case at hand, with nd1=18, nd2=1000, we'll come out with a
>> selectivity of 1/1000 which results in nrows = 1.
>>
>> Maybe it'd be better to do something else here, but I'm not sure what.
>> All of these stats-free estimates are just rules of thumb and sometimes
>> go wrong. Still, the case of one side of the join being unique and
>> the other not has to be pretty common, so it'd be nice to make it better.
>>
>> regards, tom lane
>>
>

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Justin Pryzby 2020-12-16 17:22:23 Re: pg_upgrade test for binary compatibility of core data types
Previous Message Bruce Momjian 2020-12-16 00:22:47 Re: Crash of driver on windows server