Re: Extended statistics improvement: multi-column MCV missing values

From: Zsolt Parragi <zsolt(dot)parragi(at)percona(dot)com>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Extended statistics improvement: multi-column MCV missing values
Date: 2026-06-01 22:18:20
Message-ID: CAN4CZFPySmmiepwG4+MH8nyDJb1xHK2HFcQfu4ZaEOO65jWKTg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello

> I am also not sure IN/ANY should be included in the first version.

I also only focused on testing the "basic" functionality.

> The strongest
> case is a top-level AND condition with equality clauses for all columns

I see one issue related to that with the current patch, having the
same amount of conditions as the table doesn't mean that we have
conditions for all columns.

See the following script that showcases a significant underestimate
when we use one column twice, and another 0 times in the conditions:

CREATE TABLE u (a int, b int, c int);
INSERT INTO u SELECT 0, (g % 5) + 1, g FROM generate_series(1, 50000) g;
INSERT INTO u SELECT (g % 30) + 10, (g % 30) + 10, (g % 30) + 10
FROM generate_series(1, 30000) g;
CREATE STATISTICS su (mcv) ON a, b, c FROM u;
ANALYZE u;
SELECT count(*) FROM u WHERE a = 0 AND b = ANY(ARRAY[1,2,3]); -- 30000
EXPLAIN SELECT * FROM u WHERE a = 0 AND b = ANY(ARRAY[1,2,3]);
-- ~18500, same as master
EXPLAIN SELECT * FROM u WHERE a = 0 AND a = ANY(ARRAY[0,9]) AND b =
ANY(ARRAY[1,2,3]); -- ~5200 vs ~11500 on master

In response to

Browse pgsql-hackers by date

  From Date Subject
Previous Message Andres Freund 2026-06-01 21:57:15 Re: Heads Up: cirrus-ci is shutting down June 1st