join estimate of subqueries with range conditions and constraint exclusion

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: join estimate of subqueries with range conditions and constraint exclusion
Date: 2017-05-24 21:17:30
Message-ID: 20170524211730.GM31097@telsasoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

We got bitten again by what appears to be the same issue I reported (perhaps
poorly) here:
https://www.postgresql.org/message-id/20170326193344.GS31628%40telsasoft.com

We have PG9.6.3 table heirarchies partitioned by time. Our reports use
subqueries each with their own copies of a range clauses on time column, as
needed to get constraint exclusion reference:
https://www.postgresql.org/message-id/25076.1366321335%40sss.pgh.pa.us

SELECT * FROM
(SELECT * FROM t WHERE col>const) a JOIN
(SELECT * FROM t WHERE col>const) b USING (col)

I'm diagnosing a bad estimate/plan due to excessively high n_distinct leading
to underestimated rowcount when selecting from a small fraction of the table
heirarchy. This leads intermittently to bad things, specifically a cascade of
misestimates and associated nested loops around millions of rows.

Artificial/generated/contrived test case, involving table with 99 instances
each of 99 values:

postgres=# CREATE TABLE t(i INT);
postgres=# TRUNCATE t;INSERT INTO t SELECT i FROM generate_series(1,99) i,generate_series(1,99);ANALYZE t;
postgres=# SELECT (SELECT sum(x) FROM unnest(most_common_freqs) x) frac_MCV, tablename, attname, n_distinct, array_length(most_common_vals,1) n_mcv, array_length(histogram_bounds,1) n_hist, (SELECT MAX(x) FROM unnest(most_common_vals::text::text[]) x) maxmcv, (histogram_bounds::text::text[])[array_length(histogram_bounds,1)] maxhist FROM pg_stats WHERE attname~'i' AND tablename='t' GROUP BY 1,2,3,4,5,6,7,8 ORDER BY 1 DESC;
-[ RECORD 1 ]--
frac_mcv | 1
tablename | t
attname | i
n_distinct | 99
n_mcv | 99
n_hist |
maxmcv | 99
maxhist |

range query (which could use constraint exclusion), but bad estimate:
postgres=# explain ANALYZE SELECT * FROM (SELECT * FROM t WHERE i<2) AS a JOIN (SELECT * FROM t WHERE i<2) AS b USING (i);
Merge Join (cost=339.59..341.57 rows=99 width=4) (actual time=8.272..16.892 rows=9801 loops=1)

range query which could NOT use constraint exclusion, good estimate:
postgres=# explain ANALYZE SELECT * FROM (SELECT * FROM t) AS a JOIN (SELECT * FROM t) AS b USING (i) WHERE i<2;
Hash Join (cost=264.52..541.54 rows=9801 width=4) (actual time=12.688..22.325 rows=9801 loops=1)

non-range query, good estimate:
postgres=# explain ANALYZE SELECT * FROM (SELECT * FROM t WHERE i=3) AS a JOIN (SELECT * FROM t WHERE i=3) AS b USING (i);
Nested Loop (cost=0.00..455.78 rows=9801 width=4) (actual time=0.482..15.820 rows=9801 loops=1)

My understanding:
Postgres estimates join selectivity using number of distinct values of
underlying. For the subqueries "a" and "b", the estimate is same as for
underlying table "t", even when selecting only a small fraction of the table...
This is adt/selfuncs:eqjoinsel_inner().

Note, in my tests, report queries on the child table have correct estimates;
and, queries with only "push down" WHERE clause outside the subquery have
correct estimate (but not constraint exclusion), apparently due to
calc_joinrel_size_estimate() returning the size of the parent table, planning
an join without restriction clause, following by filtering the join result, at
which point I guess the MCV list becomes useful and estimate is perfect..

SELECT * FROM
(SELECT * FROM t)a JOIN(SELECT * FROM t)b
USING (col) WHERE col>const

So my original question is basically still opened ... is it possible to get
both good estimates/plans AND constraint exclusion ??

Thanks
Justin

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Daulat Ram 2017-05-25 05:13:26 Query is very much slow
Previous Message Tomas Vondra 2017-05-24 17:25:54 Re: Query is running very slow......