Re: Planning large IN lists

From: "Atul Deopujari" <atul(dot)deopujari(at)enterprisedb(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Neil Conway" <neilc(at)samurai(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Planning large IN lists
Date: 2007-05-17 09:13:04
Message-ID: 464C1CA0.1060705@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Tom Lane wrote:
> Neil Conway <neilc(at)samurai(dot)com> writes:
>> When planning queries with a large IN expression in the WHERE clause,
>> the planner transforms the IN list into a scalar array expression. In
>> clause_selectivity(), we estimate the selectivity of the ScalarArrayExpr
>> by calling scalararraysel(), which in turn estimates the selectivity of
>> *each* array element in order to determine the selectivity of the array
>> expression as a whole.
>
>> This is quite inefficient when the IN list is large.
>
> That's the least of the problems. We really ought to convert such cases
> into an IN (VALUES(...)) type of query, since often repeated indexscans
> aren't the best implementation.
>
I thought of giving this a shot and while I was working on it, it
occurred to me that we need to decide on a threshold value of the IN
list size above which such transformation should take place. For small
sizes of the IN list, scalararraysel() of IN list wins over the hash
join involved in IN (VALUES(...)). But for larger sizes of the IN list,
IN (VALUES(...)) comes out to be a clear winner.
I would like to know what does the community think should be a heuristic
value of the IN list size beyond which this transformation should take
place.
I was thinking of a GUC variable (or a hard coded value) which defaults
to say 30. This is based on numbers from the following test:

postgres=# create table w (w text);
CREATE TABLE

postgres=# \copy w from '/usr/share/dict/words'

And run the following query with different IN list sizes
explain analyze select * from w where w in ('one', 'two', ...);

I got the following runtimes:
------------------------------------
IN list IN (VALUES(...)) IN
size
------------------------------------
150 ~2000 ms ~5500 ms
100 ~1500 ms ~4000 ms
80 ~1400 ms ~3000 ms
50 ~1400 ms ~2500 ms
30 ~1500 ms ~1500 ms
20 ~1400 ms ~1200 ms
10 ~1400 ms ~1200 ms
------------------------------------

The IN (VALUES(...)) gives an almost steady state behavior, while the IN
runtimes deteriorate with growing list size.

There would obviously be different conditions on which to base this
value. I seek community opinion on this.

--
Atul

EnterpriseDB
www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Atul Deopujari 2007-05-17 09:19:26 Re: Planning large IN lists
Previous Message Dave Page 2007-05-17 09:04:28 Re: Lack of urgency in 8.3 reviewing