Re: COALESCE patch

From: prankware <esavelievcode(at)gmail(dot)com>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: COALESCE patch
Date: 2026-06-30 13:48:26
Message-ID: CAF=hKRCX9sf5Occjdft51sWwztW_MyRN-bnVvs05yeCbBLMPNQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

rebase and refactor format for cfbot

вт, 30 июн. 2026 г. в 14:26, prankware <esavelievcode(at)gmail(dot)com>:
>
> Hi everyone,
>
> The planner ignores column statistics when an equality has a COALESCE
> expression on one side. For a clause like COALESCE(a, b) = $1, or a join on
> COALESCE(t1.a, t1.b) = COALESCE(t2.c, t2.d), there are no statistics on the
> COALESCE node itself, so eqsel() and eqjoinsel() return the generic 0.005
> estimate while the per-column stats for a, b, c and d sit unused. The only way
> around this today is an expression index or extended statistics on that exact
> expression, which doesn't scale across many different COALESCE clauses.
> estimate_hash_bucket_stats() has the same gap: a COALESCE hash key gets a
> default ndistinct and therefore a default bucket size. Since these expressions
> are common in joins and filters over nullable or fallback columns, the default
> estimate can be far enough off to flip the join order or join method.
>
> The idea is to estimate straight from the existing per-column stats, with no
> extra statistics object. COALESCE(arg_1, ..., arg_n) returns arg_i only when
> arg_1 .. arg_{i-1} are all NULL, so the chance of reaching branch i is the
> product of stanullfrac over the earlier branches. Selectivity of
> COALESCE(l_1..l_M) = COALESCE(r_1..r_N) is then the sum over branch pairs of
> P(reach l_i) * P(reach r_j) * sel(l_i = r_j), and each sel(l_i = r_j) is a
> recursive call back into eqsel()/eqjoinsel(). A non-COALESCE side is treated as
> a one-branch list, so scalar COALESCE(a, b) = const falls out of the same code,
> and the same decomposition feeds estimate_hash_bucket_stats(). If any branch is
> missing stats, the code bails and today's behavior is unchanged.
>
> A minimal example:
>
> CREATE TABLE t (a int, b int);
> INSERT INTO t
> SELECT CASE WHEN i % 5 < 2 THEN NULL ELSE i END, i
> FROM generate_series(1, 1000) i;
> ANALYZE t;
>
> EXPLAIN SELECT * FROM t WHERE COALESCE(a, b) = 42;
>
> After ANALYZE, a is NULL in 400 of the 1000 rows (stanullfrac 0.4, ndistinct
> 600) and b has no NULLs (stanullfrac 0, ndistinct 1000). COALESCE(a, b) is
> unique, so exactly one row matches. Each branch is weighted by the probability
> of reaching it, the product of stanullfrac over the branches before it:
>
> branch a: reach 1.0, sel(a = 42) = (1 - 0.4) / 600 = 0.001
> branch b: reach 0.4, sel(b = 42) = (1 - 0.0) / 1000 = 0.001
>
> selectivity = 1.0 * 0.001 + 0.4 * 0.001 = 0.0014 -> ~1 row out of 1000
>
> The patch lands on ~1 row, matching reality. The 0.005 default (5 rows) is not
> derived from the table at all: it is the 1/DEFAULT_NUM_DISTINCT constant the
> planner falls back to with no statistics on the expression, so it stays 5 rows
> regardless of the null fraction or the ndistinct of a and b.
>
> Feedback is welcome.
>
> Regards,
> Egor Savelev,
> Tantor Labs LLC,
> https://tantorlabs.com/

Attachment Content-Type Size
v1-0001-Coalesce-eqsel-eqjoinsel.patch text/x-patch 12.8 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Japin Li 2026-06-30 13:48:50 Re: WAL compression setting after PostgreSQL LZ4 default change
Previous Message Alberto Piai 2026-06-30 13:44:28 Re: Adding a stored generated column without long-lived locks