Quick Links

Re: Performance Anomaly with "col in (A,B)" vs. "col = A OR col = B" ver. 9.0.3

From:	Timothy Garnett <tgarnett(at)panjiva(dot)com>
To:	Samuel Gendler <sgendler(at)ideasculptor(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: Performance Anomaly with "col in (A,B)" vs. "col = A OR col = B" ver. 9.0.3
Date:	2011-09-27 12:57:12
Message-ID:	CAPcyiQ1cyRxKCAhsoj=4zs34CmAykDUaUk=5ca5oeAEawRsrcA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

Actually thinking about this a little more what we really want the planner
to do is to consider the codes one at a time till it finds one that exists.
If we write that out explicitly we get a good plan whether the ids are
select many rows or none.

=> explain analyze
select 1 from (
select * from (select 1 from exp_detls where hts_code_id in (469169) limit
1) a
union all
select * from (select 1 from exp_detls where hts_code_id in (15289) limit
1) b
union all
select * from (select 1 from exp_detls where hts_code_id in (468137) limit
1) c
union all
select * from (select 1 from exp_detls where hts_code_id in (14655) limit
1) d
union all
select * from (select 1 from exp_detls where hts_code_id in (14670) limit
1) e
union all
select * from (select 1 from exp_detls where hts_code_id in (15291) limit
1) f
) dummy limit 1;

QUERY
PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..1.75 rows=1 width=0) (actual time=0.031..0.032 rows=1
loops=1)
-> Result (cost=0.00..10.52 rows=6 width=0) (actual time=0.029..0.029
rows=1 loops=1)
-> Append (cost=0.00..10.52 rows=6 width=0) (actual
time=0.029..0.029 rows=1 loops=1)
-> Subquery Scan on a (cost=0.00..1.72 rows=1 width=0)
(actual time=0.027..0.027 rows=1 loops=1)
-> Limit (cost=0.00..1.71 rows=1 width=0) (actual
time=0.025..0.025 rows=1 loops=1)
-> Index Scan using
index_exp_detls_on_hts_code_id_and_data_month on exp_detls
(cost=0.00..144890.56 rows=84657 width=0) (actual time=0.025..0.025 rows=1
loops=1)
Index Cond: (hts_code_id = 469169)
-> Subquery Scan on b (cost=0.00..1.74 rows=1 width=0)
(never executed)
-> Limit (cost=0.00..1.73 rows=1 width=0) (never
executed)
-> Index Scan using
index_exp_detls_on_hts_code_id_and_data_month on exp_detls
(cost=0.00..118206.85 rows=68477 width=0) (never executed)
Index Cond: (hts_code_id = 15289)
-> Subquery Scan on c (cost=0.00..1.74 rows=1 width=0)
(never executed)
-> Limit (cost=0.00..1.73 rows=1 width=0) (never
executed)
-> Index Scan using
index_exp_detls_on_hts_code_id_and_data_month on exp_detls
(cost=0.00..102645.38 rows=59168 width=0) (never executed)
Index Cond: (hts_code_id = 468137)
-> Subquery Scan on d (cost=0.00..1.81 rows=1 width=0)
(never executed)
-> Limit (cost=0.00..1.80 rows=1 width=0) (never
executed)
-> Index Scan using
index_exp_detls_on_hts_code_id_and_data_month on exp_detls
(cost=0.00..2155.38 rows=1200 width=0) (never executed)
Index Cond: (hts_code_id = 14655)
-> Subquery Scan on e (cost=0.00..1.75 rows=1 width=0)
(never executed)
-> Limit (cost=0.00..1.74 rows=1 width=0) (never
executed)
-> Index Scan using
index_exp_detls_on_hts_code_id_and_data_month on exp_detls
(cost=0.00..90309.37 rows=51853 width=0) (never executed)
Index Cond: (hts_code_id = 14670)
-> Subquery Scan on f (cost=0.00..1.75 rows=1 width=0)
(never executed)
-> Limit (cost=0.00..1.74 rows=1 width=0) (never
executed)
-> Index Scan using
index_exp_detls_on_hts_code_id_and_data_month on exp_detls
(cost=0.00..84767.69 rows=48586 width=0) (never executed)
Index Cond: (hts_code_id = 15291)
Total runtime: 0.089 ms
(28 rows)

This is sub millisecond for all combinations of ids present or not that
we've tried, so we'll definitely go with this. Thanks for the help and
explanations!

Tim

On Tue, Sep 27, 2011 at 8:40 AM, Timothy Garnett <tgarnett(at)panjiva(dot)com>wrote:

> Hi Sam,
>
> The purpose of this (framework generated) code is to find out if there is
> at least one row that has one of the selected hts_code_ids. We don't care
> about anything that's returned other then whether at least one row exists or
> not (rewriting the query with EXISTS generates that same plan). The actual
> selectivity can vary greatly anywhere from 0 to > 500k rows depending on the
> codes chosen. On the higher end of the range a select count(*) takes ~14
> times longer then a limit 1 query that does an index scan (~475ms vs. 36ms).
>
> What we're doing now for the moment is putting a straight-jacket on the
> planner
>
> begin ; set local enable_seqscan = off ; SELECT "exp_detls".id FROM
> "exp_detls" WHERE
> ("exp_detls"."hts_code_id" IN (...)) LIMIT 1; commit ;
>
> As even when the where clause selects a good fraction of the table seq_scan
> has highly variable performance because of the clustered index (depends on
> what was last selected out of the table), so we pretty much never want to
> use it (maybe the planner should be taking the cluster into
> consideration?). What we'll probably move to is maintaining a table of ids
> that are present in that column and running the query against that as the
> above, while acceptable, can still be on the slow side when the where clause
> is not very selective and many rows are scanned in the index before the
> limit can be applied. Would be nice if the bitmap index scan could be done
> piecemeal alternating with heap scan when a tight limit is present so not so
> much work has to be done, but I could see that being really problematic to
> implement and use.
>
> Tim
>
>
> On Tue, Sep 27, 2011 at 1:06 AM, Samuel Gendler <sgendler(at)ideasculptor(dot)com
> > wrote:
>
>>
>>
>> On Mon, Sep 26, 2011 at 2:11 PM, Timothy Garnett <tgarnett(at)panjiva(dot)com>wrote:
>>
>>>
>>> Though maybe in a lot of common use situations people only supply values
>>> that are known present so maybe this would make things worse more often then
>>> better (maybe limit 1 or better EXISTS would be a hint the value is not
>>> known present). Experimenting a bit it doesn't seem to matter which values
>>> are selected so it's not taking into account any kind of distribution over
>>> the histogram boundaries.
>>
>>
>> If I'm not mistaken, the problem here is actually the LIMIT 1, yes? The
>> planner is opting for the sequential scan because it assumes it will
>> interrupt the scan relatively quickly when a row is matched? So in the case
>> where you are really looking for existence, perhaps the better solution is
>> to select a count of the number of matching rows (perhaps grouped by id so
>> you know which ones match)? That would emulate the behaviour of select
>> without a limit, which would be more likely to use the index. It all depends
>> on just what you are actually doing with the row you are returning, of
>> course, and if there is some other way to get it once you know of its
>> existence.
>>
>> SELECT count(1), exp_detls.id FROM exp_detls WHERE (exp_detls.hts_code_id
>> IN (12,654)) GROUP BY exp_detls.id
>>
>> might work, depending upon how many different values of exp_detls.id you
>> are likely to see for any given set of hts_code_ids. Actually, I know
>> little about the query planner, but it seems to me that the aggregation and
>> grouping might be sufficient to force it away from preferring the sequential
>> scan, even if you leave a 'limit 1' on the query, since it will have to find
>> more than 1 row in order to return a single row, since that single row
>> contains an aggregate. So if your concern is about the potential of
>> transferring millions of rows across the network, I think that might fix it,
>> though it is definitely a kludge. Of course, the downside is that the index
>> won't be as fast as a sequential scan in the cases where the scan does get
>> interrupted quickly, but you've clearly already considered that for your use
>> patterns.
>>
>
>

In response to

Re: Performance Anomaly with "col in (A,B)" vs. "col = A OR col = B" ver. 9.0.3 at 2011-09-27 12:40:09 from Timothy Garnett

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Tom Lane	2011-09-27 13:49:14	Re: Ineffective autovacuum
Previous Message	Timothy Garnett	2011-09-27 12:40:09	Re: Performance Anomaly with "col in (A,B)" vs. "col = A OR col = B" ver. 9.0.3