Extended multivariate statistics are ignored (potentially related to high null fraction, not sure)

From: Danny Shemesh <dany74q(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Extended multivariate statistics are ignored (potentially related to high null fraction, not sure)
Date: 2022-06-01 16:28:58
Message-ID: CAFZC=QozuGJ3HZVh-vBJJvQbZBOmOJ6Cb-ZKCk3E98Enpu=SYg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hey everyone,

I'm working on improving gathered statistics on several large tables (2TB /
500M records);
I've created extended stats on correlated columns, ran analyze, compared
the pre and post explains, and it seems as though the extended statistics
are ignored -
the estimation doesn't change much, and I can accurately derive the est.
rows from multiplying the single-column freqs from pg_stats with the
supplied values.

Context:
- pg 12.8
- relation w/ three columns of interest - user (text), row_type (text),
deleted_at (timestamp).
- different users have different distributions for row_type
- deleted_at is set if said record is marked for deletion - stays marked
for a long retention period, around 95% of the rows have a value, and
around 5% have null
- we query and operate on the 5% of records with deleted_at = null, most of
our indexes are partial on that as well

I'll give a simplified example with two users, one being extremely dominant
w/ 99.9% of the data; the next holds much less, but still accounts to
around 200k rows.

From pg_stats:
attname | user
n_distinct | 2.0
most_common_vals | {A,B}
most_common_freqs | [0.9996333, 0.00036666667]

attname | row_type
n_distinct | 4.0
most_common_vals | {A,B}
most_common_freqs | [0.9968, 0.0025333334]

attname | deleted_at
n_distinct | 20761.0
null_frac | 0.043133333
<values / bounds are timestamps, omitted for brevity>

Querying before extended stats:

*$> explain select 1 from my_rel where user ='B and row_type = 'A' and
deleted_at is null;Index Only Scan using
idx_user_row_type_where_deleted_at_is_null on my_rel (cost=0.69..2851.78
rows=8213 width=4)Index Cond: ((user = 'B') AND (row_type = 'B'))*

The number is derived from:
reltuples*user_b_freq*row_type_a_freq*deleted_at_null_frac
= 520982816*0.00036666667*0.9968*0.043133333 = 8213.26586

When explain-analyzing, I get - *(actual time=0.051..72.503 rows=174954
loops=1)*
So in this specific case, the estimation is off by around 20x (note that
this is a simplified case just to showcase the symptom).

I then create extended stats - I've tried to add them on all three columns
in all combinations, and in pairs in all combinations - all leading to the
same result,
I'll only showcase the three column variant for brevity:

*$> create statistics s1 on user, row_type, deleted_at from my_rel;*
*$> analyze my_rel;*
*$> explain select 1 from my_rel where user ='B' and row_type = 'A' and
deleted_at is null;*
*Index Only Scan using **idx_user_row_type_where_deleted_at_is_null** on
my_rel (cost=0.69..1560.01 rows=4491 width=4)*

After analyzing, pg_stats contain different values, as the large table is
sampled - I have user_b_freq = 0.0002, row_type_a_freq = 0.99686664,
deleted_at_null_frac = 0.043233335,
thus the new calculation is: 520982816*0.0002*0.99686664*0.043233335
= 4490.64987.

Now it's off by around 40x - and it seems to still only consider the single
column distributions.

Is there anything I'm missing ? I thought that maybe in my case, due to the
high null fractions of deleted_at, the extended stats aren't used, but
couldn't find an obvious hint from the code that would suggest that.

Appreciate your time !
Danny

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Bruce Momjian 2022-06-01 20:08:38 Re: Extended multivariate statistics are ignored (potentially related to high null fraction, not sure)
Previous Message Tom Lane 2022-06-01 13:46:17 Re: function currtid2() in SQL and ESQL/C to get the new CTID of a row