Re: Implement missing join selectivity estimation for range types

From: jian he <jian(dot)universality(at)gmail(dot)com>
To: vignesh C <vignesh21(at)gmail(dot)com>
Cc: Schoemans Maxime <maxime(dot)schoemans(at)ulb(dot)be>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Damir Belyalov <dam(dot)bel07(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, SAKR Mahmoud <mahmoud(dot)sakr(at)ulb(dot)be>, Diogo Repas <diogo(dot)repas(at)gmail(dot)com>, LUO Zhicheng <zhicheng(dot)luo(at)ulb(dot)be>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Andrey Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>
Subject: Re: Implement missing join selectivity estimation for range types
Date: 2024-01-22 08:10:37
Message-ID: CACJufxFhES7untAPmxJDrB-y_pNgah_1HNbXjcOj6g=yxh3kdA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I cannot figure out why it aborts.

as Tom mentioned in upthread about the test cases.
similar to src/test/regress/sql/stats_ext.sql check_estimated_rows function.
we can test it by something:

create or replace function check_estimated_rows(text) returns table (ok bool)
language plpgsql as
$$
declare
ln text;
tmp text[];
first_row bool := true;
begin
for ln in
execute format('explain analyze %s', $1)
loop
if first_row then
first_row := false;
tmp := regexp_match(ln, 'rows=(\d*) .* rows=(\d*)');
return query select 0.2 < tmp[1]::float8 / tmp[2]::float8
and tmp[1]::float8 / tmp[2]::float8 < 5;
end if;
end loop;
end;
$$;

select * from check_estimated_rows($$select * from test_range_join_1,
test_range_join_2 where ir1 && ir2$$);
select * from check_estimated_rows($$select * from test_range_join_1,
test_range_join_2 where ir1 << ir2$$);
select * from check_estimated_rows($$select * from test_range_join_1,
test_range_join_2 where ir1 >> ir2$$);

Do you need 3 tables to do the test? because we need to actually run
the query then compare the estimated row
and actually returned rows.
If you really execute the query with 3 table joins, it will take a lot of time.
So two tables join with where quql should be fine?

/* Fast-forwards i and j to start of iteration */
+ for (i = 0; range_cmp_bound_values(typcache, &hist1[i], &hist2[0]) < 0; i++);
+ for (j = 0; range_cmp_bound_values(typcache, &hist2[j], &hist1[0]) < 0; j++);
+
+ /* Do the estimation on overlapping regions */
+ while (i < nhist1 && j < nhist2)
+ {
+ double cur_sel1,
+ cur_sel2;
+ RangeBound cur_sync;
+
+ if (range_cmp_bound_values(typcache, &hist1[i], &hist2[j]) < 0)
+ cur_sync = hist1[i++];
+ else if (range_cmp_bound_values(typcache, &hist1[i], &hist2[j]) > 0)
+ cur_sync = hist2[j++];
+ else
+ {
+ /* If equal, skip one */
+ cur_sync = hist1[i];
+

this part range_cmp_bound_values "if else if" part computed twice, you
can just do
`
int cmp;
cmp = range_cmp_bound_values(typcache, &hist1[i], &hist2[j]);
if cmp <0 then
else if cmp > 0 then
else then
`

also. I think you can put the following into main while loop.
+ for (i = 0; range_cmp_bound_values(typcache, &hist1[i], &hist2[0]) < 0; i++);
+ for (j = 0; range_cmp_bound_values(typcache, &hist2[j], &hist1[0]) < 0; j++);

split range and multirange into 2 patches might be a good idea.
seems: same function (calc_hist_join_selectivity) with same function
signature in src/backend/utils/adt/multirangetypes_selfuncs.c
and src/backend/utils/adt/rangetypes_selfuncs.c,
previously mail complaints not resolved.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message John Naylor 2024-01-22 08:18:03 Re: [PoC] Improve dead tuple storage for lazy vacuum
Previous Message Michael Paquier 2024-01-22 07:52:52 Re: introduce dynamic shared memory registry