Bogus ANALYZE results for an otherwise-unique column with many nulls

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Cc: Andreas Joseph Krogh <andreas(at)visena(dot)com>
Subject: Bogus ANALYZE results for an otherwise-unique column with many nulls
Date: 2016-08-04 22:39:31
Message-ID: 16143.1470350371@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I looked into the problem described at
https://www.postgresql.org/message-id/flat/VisenaEmail.26.df42f82acae38a58.156463942b8%40tc7-visena
and I believe I've reproduced it: the requirement is that the inner join
column for the antijoin must contain a lot of NULL values, and what isn't
NULL must be unique or nearly so. If ANALYZE doesn't come across any
duplicated values, it will set the column's stadistinct value to "-1",
which causes the planner to believe that each row of the inner table
produces a unique value, resulting in a bogus answer from
get_variable_numdistinct() and thence a bogus join size estimate.

Here's an example in the regression database, making use of the existing
unique column tenk1.unique1:

regression=# create table manynulls as select case when random() < 0.1 then unique1 else null end as unique1 from tenk1;
SELECT 10000
regression=# analyze manynulls;
ANALYZE
regression=# explain analyze select * from tenk1 t where not exists(select 1 from manynulls m where m.unique1 = t.unique1);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Hash Anti Join (cost=261.00..756.50 rows=1 width=244) (actual time=4.632..14.729 rows=8973 loops=1)
Hash Cond: (t.unique1 = m.unique1)
-> Seq Scan on tenk1 t (cost=0.00..458.00 rows=10000 width=244) (actual time=0.015..2.683 rows=10000 loops=1)
-> Hash (cost=136.00..136.00 rows=10000 width=4) (actual time=4.553..4.553 rows=1027 loops=1)
Buckets: 16384 Batches: 1 Memory Usage: 165kB
-> Seq Scan on manynulls m (cost=0.00..136.00 rows=10000 width=4) (actual time=0.019..2.668 rows=10000 loops=1)
Planning time: 0.808 ms
Execution time: 15.670 ms
(8 rows)

So the antijoin size estimate is way off, but it's hardly the planner's
fault because the stats are insane:

regression=# select attname,null_frac,n_distinct from pg_stats where tablename = 'manynulls';
attname | null_frac | n_distinct
---------+-----------+------------
unique1 | 0.8973 | -1
(1 row)

With the patch attached below, ANALYZE produces

regression=# analyze manynulls;
ANALYZE
regression=# select attname,null_frac,n_distinct from pg_stats where tablename = 'manynulls';
attname | null_frac | n_distinct
---------+-----------+------------
unique1 | 0.8973 | -0.1027
(1 row)

and now the join size estimate is dead on:

regression=# explain analyze select * from tenk1 t where not exists(select 1 from manynulls m where m.unique1 = t.unique1);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Hash Anti Join (cost=261.00..847.69 rows=8973 width=244) (actual time=4.501..13.888 rows=8973 loops=1)
Hash Cond: (t.unique1 = m.unique1)
-> Seq Scan on tenk1 t (cost=0.00..458.00 rows=10000 width=244) (actual time=0.031..4.959 rows=10000 loops=1)
-> Hash (cost=136.00..136.00 rows=10000 width=4) (actual time=4.404..4.404 rows=1027 loops=1)
Buckets: 16384 Batches: 1 Memory Usage: 165kB
-> Seq Scan on manynulls m (cost=0.00..136.00 rows=10000 width=4) (actual time=0.034..2.576 rows=10000 loops=1)
Planning time: 1.388 ms
Execution time: 14.542 ms
(8 rows)

What I did in the patch is to scale the formerly fixed "-1.0" stadistinct
estimate to discount the fraction of nulls we found. An alternative
possibility might be to decree that a fractional stadistinct considers
only non-nulls, but then all the other paths through ANALYZE would be
wrong. The spec for it in pg_statistic.h doesn't suggest any such
interpretation, either.

Looking around, there are a couple of places outside commands/analyze.c
that are making the same mistake, so this patch isn't complete, but it
illustrates what needs to be done.

This is a bit reminiscent of the nulls-accounting problem we fixed in
commit be4b4dc75, though that tended to result in underestimates not
overestimates of the number of distinct values. We didn't back-patch
that fix, so probably we shouldn't back-patch this either. On the other
hand, it is far more open-and-shut that this is wrong. Thoughts?

regards, tom lane

Attachment Content-Type Size
another-analyze-with-nulls-fix.patch text/x-diff 1.9 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2016-08-04 22:41:23 Re: Heap WARM Tuples - Design Draft
Previous Message Bruce Momjian 2016-08-04 22:21:48 Re: Heap WARM Tuples - Design Draft