Re: Improve row estimation with multi-column unique indexes

From: Chengpeng Yan <chengpeng_yan(at)outlook(dot)com>
To: Anthonin Bonnefoy <anthonin(dot)bonnefoy(at)datadoghq(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improve row estimation with multi-column unique indexes
Date: 2026-06-28 05:13:47
Message-ID: 0AF0607F-5FC9-46BD-B593-DE78BEBAAA36@outlook.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

> On Jun 24, 2026, at 16:24, Anthonin Bonnefoy <anthonin(dot)bonnefoy(at)datadoghq(dot)com> wrote:
>
> Hi,
>
> Currently, the row estimation of a query matching a multi-column
> unique index only relies on combining columns' selectivity. This leads
> to a row estimation >1 despite matching the unique index, and possibly
> avoiding using said index.
>
> This issue is visible with the following queries:
>
> CREATE TABLE visitor (id BIGINT GENERATED ALWAYS AS IDENTITY, org INT,
> name TEXT);
> CREATE UNIQUE INDEX ON visitor(org, name);
> INSERT INTO visitor(org, name) SELECT *, 'Support' FROM
> generate_series(1, 500000);
> INSERT INTO visitor(org, name) SELECT 500000, 'User_' || i FROM
> generate_series(1, 2000) as g(i);
> -- Test with only multi column index
> ANALYZE visitor;
> EXPLAIN SELECT * FROM visitor WHERE visitor.org = 500000 AND
> visitor.name = 'Support' ORDER BY visitor.id LIMIT 1;
> -- Test with index on id
> CREATE UNIQUE INDEX ON visitor(id);
> EXPLAIN SELECT * FROM visitor WHERE visitor.org = 500000 AND
> visitor.name = 'Support' ORDER BY visitor.id LIMIT 1;
>
> Which gives the following plans:
> QUERY PLAN
> -------------------------------------------------------------------------------------------------------
> Limit (cost=2861.98..2861.98 rows=1 width=20)
> -> Sort (cost=2861.98..2867.35 rows=2149 width=20)
> Sort Key: id
> -> Index Scan using visitor_org_name_idx on visitor
> (cost=0.42..2851.24 rows=2149 width=20)
> Index Cond: ((org = 500000) AND (name = 'Support'::text))
>
> QUERY PLAN
> --------------------------------------------------------------------------------------------
> Limit (cost=0.42..9.15 rows=1 width=20)
> -> Index Scan using visitor_id_idx on visitor
> (cost=0.42..18753.42 rows=2149 width=20)
> Filter: ((org = 500000) AND (name = 'Support'::text))
>
> Despite matching the unique index, the planner estimates 2149 rows.
> Adding an index on id, the planner sees the new index as cheaper and
> switches to it.
>
> There's already an unique index matching done in btcostestimate, but
> it only adjusts numIndexTuples to 1. The attached patch improves this
> behavior by:
> - Adjusting the index's selectivity to match the numIndexTuples=1 value
> - cost->indexSelectivity is used to propagate this new selectivity,
> which is then used in genericcostestimate, similar to what's done with
> numIndexTuples
> - Index's path row is also adjusted in cost_index if index selectivity
> is better than the initial row estimation
>
> With those changes, the query plan becomes:
> Limit (cost=8.45..8.46 rows=1 width=20)
> -> Sort (cost=8.45..8.46 rows=1 width=20)
> Sort Key: id
> -> Index Scan using visitor_org_name_idx on visitor
> (cost=0.42..8.44 rows=1 width=20)
> Index Cond: ((org = 500000) AND (name = 'Support'::text))
>
> Regards,
> Anthonin Bonnefoy
> <v1-0001-Adjust-row-estimation-and-index-selectivity-with-.patch>

Thanks for working on this. I agree this is a real planner estimation
problem: when equality clauses constrain all key columns of a
multi-column unique index, the planner has a much stronger upper bound
than what independent per-column selectivity can provide.

I think the v1 patch identifies the right fact, but applies it at the
wrong level. The optimizer README expects paths for the same relation
and parameterization to have the same rowcount estimate. v1 instead
lowers only the B-tree `IndexPath`'s `Path.rows`, using
`indexSelectivity` after the AM cost hook has run. That makes the
one-row estimate depend on choosing that access path. A full unique
equality match is a relation-level one-row proof; it should update the
relation row estimate, not use `indexSelectivity` as a rowcount channel.

I put together a v2 that keeps those concerns separate:

- row estimates are clamped where they are produced:
`set_baserel_size_estimates()` for base restrictions, and
`get_parameterized_baserel_size()` when outer-parameter clauses are
needed;
- B-tree costing still caps `indexSelectivity` for the matching lookup,
so heap fetch costing also sees the one-tuple bound;
- the proof reuses the existing `relation_has_unique_index_for()`
machinery, via a small wrapper for base and parameterized clauses.

I also added regression coverage for the positive base and parameterized
cases, a missing-key case that only constrains part of a multi-column
unique key, and the two dangerous negative cases: `IS NULL` with a
regular unique index, and deferrable unique constraints. The latter two
must not be treated as one-row proofs. I also made the B-tree unique
lookup shortcut require immediate uniqueness for the same reason.

I attached the patch. make check passed here. I'd be interested in
your thoughts on whether this split addresses the issue you were
targeting in v1.

--
Best regards,
Chengpeng Yan

Attachment Content-Type Size
v2-0001-Improve-estimates-for-multicolumn-unique-indexes.patch application/octet-stream 16.5 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bill Kim 2026-06-28 05:36:29 doc: fix two id/xreflabel inconsistencies in config.sgml
Previous Message Xuneng Zhou 2026-06-28 02:10:11 Re: [PATCH] Prevent repeated deadlock-check signals in standby buffer pin waits