Re: Implement missing join selectivity estimation for range types

From: Haibo Yan <tristan(dot)yim(at)gmail(dot)com>
To: jian he <jian(dot)universality(at)gmail(dot)com>
Cc: vignesh C <vignesh21(at)gmail(dot)com>, Schoemans Maxime <maxime(dot)schoemans(at)ulb(dot)be>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Damir Belyalov <dam(dot)bel07(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, SAKR Mahmoud <mahmoud(dot)sakr(at)ulb(dot)be>, Diogo Repas <diogo(dot)repas(at)gmail(dot)com>, LUO Zhicheng <zhicheng(dot)luo(at)ulb(dot)be>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Andrey Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>
Subject: Re: Implement missing join selectivity estimation for range types
Date: 2026-04-06 23:32:02
Message-ID: CABXr29HUT2m7YPpB4gOZHphYmwpwuEWFZ_p6oeyouhWw8Zc1Tw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Apr 6, 2026 at 3:12 PM jian he <jian(dot)universality(at)gmail(dot)com> wrote:

> I cannot figure out why it aborts.
>
> as Tom mentioned in upthread about the test cases.
> similar to src/test/regress/sql/stats_ext.sql check_estimated_rows
> function.
> we can test it by something:
>
> create or replace function check_estimated_rows(text) returns table (ok
> bool)
> language plpgsql as
> $$
> declare
> ln text;
> tmp text[];
> first_row bool := true;
> begin
> for ln in
> execute format('explain analyze %s', $1)
> loop
> if first_row then
> first_row := false;
> tmp := regexp_match(ln, 'rows=(\d*) .* rows=(\d*)');
> return query select 0.2 < tmp[1]::float8 / tmp[2]::float8
> and tmp[1]::float8 / tmp[2]::float8 < 5;
> end if;
> end loop;
> end;
> $$;
>
> select * from check_estimated_rows($$select * from test_range_join_1,
> test_range_join_2 where ir1 && ir2$$);
> select * from check_estimated_rows($$select * from test_range_join_1,
> test_range_join_2 where ir1 << ir2$$);
> select * from check_estimated_rows($$select * from test_range_join_1,
> test_range_join_2 where ir1 >> ir2$$);
>
> Do you need 3 tables to do the test? because we need to actually run
> the query then compare the estimated row
> and actually returned rows.
> If you really execute the query with 3 table joins, it will take a lot of
> time.
> So two tables join with where quql should be fine?
>
> /* Fast-forwards i and j to start of iteration */
> + for (i = 0; range_cmp_bound_values(typcache, &hist1[i], &hist2[0]) < 0;
> i++);
> + for (j = 0; range_cmp_bound_values(typcache, &hist2[j], &hist1[0]) < 0;
> j++);
> +
> + /* Do the estimation on overlapping regions */
> + while (i < nhist1 && j < nhist2)
> + {
> + double cur_sel1,
> + cur_sel2;
> + RangeBound cur_sync;
> +
> + if (range_cmp_bound_values(typcache, &hist1[i], &hist2[j]) < 0)
> + cur_sync = hist1[i++];
> + else if (range_cmp_bound_values(typcache, &hist1[i], &hist2[j]) > 0)
> + cur_sync = hist2[j++];
> + else
> + {
> + /* If equal, skip one */
> + cur_sync = hist1[i];
> +
>
> this part range_cmp_bound_values "if else if" part computed twice, you
> can just do
> `
> int cmp;
> cmp = range_cmp_bound_values(typcache, &hist1[i], &hist2[j]);
> if cmp <0 then
> else if cmp > 0 then
> else then
> `
>
> also. I think you can put the following into main while loop.
> + for (i = 0; range_cmp_bound_values(typcache, &hist1[i], &hist2[0]) < 0;
> i++);
> + for (j = 0; range_cmp_bound_values(typcache, &hist2[j], &hist1[0]) < 0;
> j++);
>
> split range and multirange into 2 patches might be a good idea.
> seems: same function (calc_hist_join_selectivity) with same function
> signature in src/backend/utils/adt/multirangetypes_selfuncs.c
> and src/backend/utils/adt/rangetypes_selfuncs.c,
> previously mail complaints not resolved.
>
>
>
> Hi,all
I'd like to revive the discussion on improving selectivity estimation for
range/range joins.
Attached is the v5 patch which teaches rangejoinsel to use range bound
histograms for estimating the <<, >>, and && operators. Currently, the
planner often falls back to a hardcoded default (like 0.005), which can
lead to poor join ordering in complex queries.
In this version, I have intentionally excluded &< and &>.
a &< b essentially maps to upper(a) <= upper(b).
a &> b essentially maps to lower(a) >= lower(b).
Since these operators include equality (<= / >=) rather than strict
inequality (< / >), their estimation is slightly more nuanced. I believe
focusing on the strict inequality and overlap operators first allows us to
deliver a clean, converged, and significantly beneficial improvement. We
can discuss the best approach for the remaining operators once this
foundation is in place.
Test Results
My local tests show that the planner now correctly identifies cases with
zero or full selectivity, which were previously misestimated.
----------------------------------------------------------------------
CREATE TABLE t1 (id int, r int4range);
CREATE TABLE t2 (id int, r int4range);
INSERT INTO t1 SELECT g, int4range(g * 2 - 1, g * 2) FROM
generate_series(1, 1000) g;
INSERT INTO t2 SELECT g, int4range(10000 + g * 200, 10000 + g * 200 + 100)
FROM generate_series(1, 1000) g;
ANALYZE t1, t2;
----------------------------------------------------------------------

Real Selectivity:
----------------------------------------------------------------------
SELECT
avg((t1.r << t2.r)::int) AS p_ll, -- Expected: 1.0
avg((t1.r >> t2.r)::int) AS p_rr, -- Expected: 0.0
avg((t1.r && t2.r)::int) AS p_ov -- Expected: 0.0
FROM t1, t2;
-- Result: p_ll = 1.0, p_rr = 0.0, p_ov = 0.0
----------------------------------------------------------------------

Planner Improvements:
With the patch, the EXPLAIN output reflects these probabilities accurately:
For << (High Selectivity):
The planner correctly estimates rows=1000000 (1000 * 1000 * 1.0).
----------------------------------------------------------------------
Nested Loop (cost=29.50..15080.75 rows=1000000 width=61)
Join Filter: (t1.r << t2.r)
----------------------------------------------------------------------
For && and >> (Near-Zero Selectivity):
The planner now correctly predicts rows=1 instead of using the default
multiplier.
----------------------------------------------------------------------
Nested Loop (cost=0.00..15036.50 rows=1 width=36)
Join Filter: (t1.r && t2.r)
This improved estimation allows the optimizer to make much better decisions
regarding join order and nesting when range columns are involved.
I look forward to your feedback.
Regards,
Haibo

Attachment Content-Type Size
v5-0001-Improve-range-range-join-selectivity-estimation.patch application/octet-stream 20.7 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Zsolt Parragi 2026-04-06 23:36:27 Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Previous Message Zsolt Parragi 2026-04-06 23:09:33 Re: SLOPE - Planner optimizations on monotonic expressions.