From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Peter Eisentraut <peter_e(at)gmx(dot)net> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: limit in subquery causes poor selectivity estimation |
Date: | 2011-09-02 16:45:19 |
Message-ID: | 19020.1314981919@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Peter Eisentraut <peter_e(at)gmx(dot)net> writes:
> On lör, 2011-08-27 at 13:32 -0400, Tom Lane wrote:
>> The larger problem is that if a subquery didn't get flattened, it's
>> often because it's got LIMIT, or GROUP BY, or some similar clause that
>> makes it highly suspect whether the statistics available for the table
>> column are reasonable to use for the subquery outputs. It wouldn't be
>> that hard to grab the stats for test2.sha1, but then how do you want
>> to adjust them to reflect the LIMIT?
> It turns out that this is a regression introduced in 8.4.8;
Well, the fact that examine_variable punts on subqueries is certainly
not a "regression introduced in 8.4.8"; it's always been like that.
I think your observation that 8.4.8 is worse has to be blamed on
commit 0ae8b300388c2a3eaf90e6e6f13d6be1f4d4ac2d, which introduced a
fallback rule of assuming 0.5 selectivity for a semijoin if we didn't
have non-default ndistinct estimates on both sides. Before that, 8.4.x
would go ahead and apply its heuristic rule, essentially Min(nd2/nd1, 1),
even when one or both ndistinct values were completely made-up.
I'm not sure what we could do instead. Perhaps you could argue that
we should just revert that commit on the grounds that it's doing more
harm than good, but I don't really believe that --- I think reverting
would just move the pain points around. It's pure luck that 8.4.8
is worse rather than better on the particular example you cite.
On a longer-term basis, I'm looking into what we could do with
extracting stats from subqueries, but that doesn't seem like material
for a backpatch. I have a draft patch that I've been playing with
(attached). The main thing I feel unsure about is whether it's
reasonable to extract stats in this way from a subquery that has GROUP
BY or DISTINCT. ISTM it's probably okay to ignore joining, sorting, or
limiting in the subquery: those might affect the stats of the subquery
output, but this is no worse than using the unmodified column statistics
for any other join-level selectivity estimate (where we already ignore
the effects of scan-level restriction clauses that will filter the
column values). But GROUP BY or DISTINCT would entirely invalidate the
column frequency statistics, which makes me think that ignoring the
pg_statistic entry might be the thing to do. Comments?
regards, tom lane
Attachment | Content-Type | Size |
---|---|---|
wip-subquery-stats.patch | text/x-patch | 6.1 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2011-09-02 16:52:05 | Re: pgsql: Remove "fmgr.h" include in cube contrib --- caused crash on a Ge |
Previous Message | Bruce Momjian | 2011-09-02 16:44:28 | Re: pgsql: Remove "fmgr.h" include in cube contrib --- caused crash on a Ge |