Re: Performance Anomaly with "col in (A,B)" vs. "col = A OR col = B" ver. 9.0.3

From: Samuel Gendler <sgendler(at)ideasculptor(dot)com>
To: Timothy Garnett <tgarnett(at)panjiva(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Performance Anomaly with "col in (A,B)" vs. "col = A OR col = B" ver. 9.0.3
Date: 2011-09-27 05:06:26
Message-ID: CAEV0TzBwL1A6yeXNqQis3bs_V1bTyQwAjtOhfh8yzbGBB-Qzng@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Mon, Sep 26, 2011 at 2:11 PM, Timothy Garnett <tgarnett(at)panjiva(dot)com>wrote:

>
> Though maybe in a lot of common use situations people only supply values
> that are known present so maybe this would make things worse more often then
> better (maybe limit 1 or better EXISTS would be a hint the value is not
> known present). Experimenting a bit it doesn't seem to matter which values
> are selected so it's not taking into account any kind of distribution over
> the histogram boundaries.

If I'm not mistaken, the problem here is actually the LIMIT 1, yes? The
planner is opting for the sequential scan because it assumes it will
interrupt the scan relatively quickly when a row is matched? So in the case
where you are really looking for existence, perhaps the better solution is
to select a count of the number of matching rows (perhaps grouped by id so
you know which ones match)? That would emulate the behaviour of select
without a limit, which would be more likely to use the index. It all depends
on just what you are actually doing with the row you are returning, of
course, and if there is some other way to get it once you know of its
existence.

SELECT count(1), exp_detls.id FROM exp_detls WHERE (exp_detls.hts_code_id IN
(12,654)) GROUP BY exp_detls.id

might work, depending upon how many different values of exp_detls.id you are
likely to see for any given set of hts_code_ids. Actually, I know little
about the query planner, but it seems to me that the aggregation and
grouping might be sufficient to force it away from preferring the sequential
scan, even if you leave a 'limit 1' on the query, since it will have to find
more than 1 row in order to return a single row, since that single row
contains an aggregate. So if your concern is about the potential of
transferring millions of rows across the network, I think that might fix it,
though it is definitely a kludge. Of course, the downside is that the index
won't be as fast as a sequential scan in the cases where the scan does get
interrupted quickly, but you've clearly already considered that for your use
patterns.

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Royce Ausburn 2011-09-27 05:08:26 Re: Ineffective autovacuum
Previous Message Ben Chobot 2011-09-27 04:54:10 Re: postgres constraint triggers