Re: Performance Anomaly with "col in (A,B)" vs. "col = A OR col = B" ver. 9.0.3

From: Timothy Garnett <tgarnett(at)panjiva(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Performance Anomaly with "col in (A,B)" vs. "col = A OR col = B" ver. 9.0.3
Date: 2011-09-26 21:11:33
Message-ID: CAPcyiQ20JUh=jGaM95rjHZ0yG5VNYsXNjvskZovFXHp7yqNgSQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

> Well, the reason it likes the first plan is that that has a smaller
> estimated cost ;-). Basically this is a startup-time-versus-total-time
> issue: the cost of the seqscan+limit is estimated to be about 1/8447'th
> of the time to read the whole table, since it's estimating 8447
> candidate matches and assuming that those are uniformly distributed in
> the table. Meanwhile, the bitmap scan has a significant startup cost
> because the entire indexscan is completed before we start to do any
> fetching from the heap. The overestimate of the number of matching
> rows contributes directly to overestimating the cost of the indexscan,
> too. It ends up being a near thing --- 158 vs 166 cost units --- but
> on the basis of these estimates the planner did the right thing.
>
> So, what you need to do to make this better is to get it to have a
> better idea of how many rows match the query condition; the overestimate
> is both making the expensive plan look cheap, and making the cheaper
> plan look expensive. Cranking up the statistics target for the
> hts_code_id column (and re-ANALYZEing) ought to fix it. If all your
> tables are this large you might want to just increase
> default_statistics_target across the board.
>
> regards, tom lane
>

Thanks for the great description of what's happening. Very informative.
Upping the stats to the max 10000 (from the default 100) makes my example
query use a faster plan, but not other problematic queries we have in the
same vein (just add a few more values to the in clause). For ex. (this is
with stats set to 10000 and re-analyzed).

=>explain analyze SELECT "exp_detls".id FROM "exp_detls" WHERE
("exp_detls"."hts_code_id" IN
(8127,8377,8374,10744,11125,8375,8344,9127,9345)) LIMIT 1;
QUERY
PLAN
---------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..152.94 rows=1 width=4) (actual time=12057.637..12057.637
rows=0 loops=1)
-> Seq Scan on exp_detls (cost=0.00..1651399.83 rows=10798 width=4)
(actual time=12057.635..12057.635 rows=0 loops=1)
Filter: (hts_code_id = ANY
('{8127,8377,8374,10744,11125,8375,8344,9127,9345}'::integer[]))
Total runtime: 12057.678 ms

From your description I think the planner is making two problematic
assumptions that are leading to our issue:

First is that the potential matches are uniformly distributed across the
physical table. While there are a number of reasons this may not be the
case (correlated insertion order or update patterns etc.), in this case
there's a very clear reason which is that the table is clustered on an index
that leads with the column we're querying against ('hts_code_id') and
nothing has been inserted or updated in the table since the last time we ran
cluster on it (see the schema in the original e-mail).

Second is that is that it's assuming that the IN clause values are actually
present in the table (and at some reasonably large frequency), in the worst
cases (like above) they aren't present at all, forcing a full table scan. I
don't know how the expected freq of values not present in the frequent
values are estimated, but I'm guessing something based on the residual
probability in the stats table and the est. number of distinct values? If
so that assumes that the supplied values actually are in the set of distinct
values which seems unjustified.

Perhaps there should be some estimation on whether the supplied value is one
of the distinct values or not (could probably do something pretty
statistically solid with a not too large bloom filter), or scaling by the
width of the histogram bucket it occurs in (i.e. assuming an even
distribution across the bucket). Though maybe in a lot of common use
situations people only supply values that are known present so maybe this
would make things worse more often then better (maybe limit 1 or better
EXISTS would be a hint the value is not known present). Experimenting a bit
it doesn't seem to matter which values are selected so it's not taking into
account any kind of distribution over the histogram boundaries. For ex this
query:

explain analyze SELECT "exp_detls".id FROM "exp_detls" WHERE
("exp_detls"."hts_code_id" IN
(20000,20001,200002,20003,20004,20005,20006,20007,20008)) LIMIT 1;
QUERY
PLAN
---------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..152.94 rows=1 width=4) (actual time=12925.646..12925.646
rows=0 loops=1)
-> Seq Scan on exp_detls (cost=0.00..1651399.83 rows=10798 width=4)
(actual time=12925.644..12925.644 rows=0 loops=1)
Filter: (hts_code_id = ANY
('{20000,20001,200002,20003,20004,20005,20006,20007,20008}'::integer[]))
Total runtime: 12925.677 ms

Has the same expected row count as the earlier query with the same number of
not present IN values, but looking at the histogram boundaries these values
are much, much less likely to find rows (assuming an even distribution
across the bucket) then the values in the earlier query (histogram
boundaries are something like 1,5,10,15,...11995,12000,80000,80005...) I can
send the actual stats if interested (it's large though). Feels like this
should factor into the estimates somehow.

Also, we occasionally run into surprises like this esp. as tables grow (fast
query plans hit some planning tipping point and turn into very slow plans,
or some new combination of values is requested) etc. Would be nice if there
was some straightforward way to tweak the planner towards less risky
queries, not necessarily worst case planning but maybe pick the plan based
on expected cost + n% * 'worst case' cost for some reasonable estimation of
worst case or the like (or using some probability distribution around the
expected cost and picking the nth percentile time for ranking). Just
thinking out loud.

Btw, thanks for all your work on what is a really great and useful tool!

Tim

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Filip Rembiałkowski 2011-09-26 23:45:59 Re: slow query on tables with new columns added.
Previous Message M. D. 2011-09-26 20:13:10 Re: slow query on tables with new columns added.