Re: Planning large IN lists

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Atul Deopujari <atul(dot)deopujari(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Neil Conway <neilc(at)samurai(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Planning large IN lists
Date: 2008-03-11 18:15:05
Message-ID: 200803111815.m2BIF5e17619@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Added to TODO:

* Consider using a hash for joining to a large IN (VALUES ...) list

http://archives.postgresql.org/pgsql-hackers/2007-05/msg00450.php

---------------------------------------------------------------------------

Atul Deopujari wrote:
> Hi,
>
> Tom Lane wrote:
> > Neil Conway <neilc(at)samurai(dot)com> writes:
> >> When planning queries with a large IN expression in the WHERE clause,
> >> the planner transforms the IN list into a scalar array expression. In
> >> clause_selectivity(), we estimate the selectivity of the ScalarArrayExpr
> >> by calling scalararraysel(), which in turn estimates the selectivity of
> >> *each* array element in order to determine the selectivity of the array
> >> expression as a whole.
> >
> >> This is quite inefficient when the IN list is large.
> >
> > That's the least of the problems. We really ought to convert such cases
> > into an IN (VALUES(...)) type of query, since often repeated indexscans
> > aren't the best implementation.
> >
> I thought of giving this a shot and while I was working on it, it
> occurred to me that we need to decide on a threshold value of the IN
> list size above which such transformation should take place. For small
> sizes of the IN list, scalararraysel() of IN list wins over the hash
> join involved in IN (VALUES(...)). But for larger sizes of the IN list,
> IN (VALUES(...)) comes out to be a clear winner.
> I would like to know what does the community think should be a heuristic
> value of the IN list size beyond which this transformation should take
> place.
> I was thinking of a GUC variable (or a hard coded value) which defaults
> to say 30. This is based on numbers from the following test:
>
> postgres=# create table w (w text);
> CREATE TABLE
>
> postgres=# \copy w from '/usr/share/dict/words'
>
> And run the following query with different IN list sizes
> explain analyze select * from w where w in ('one', 'two', ...);
>
> I got the following runtimes:
> ------------------------------------
> IN list IN (VALUES(...)) IN
> size
> ------------------------------------
> 150 ~2000 ms ~5500 ms
> 100 ~1500 ms ~4000 ms
> 80 ~1400 ms ~3000 ms
> 50 ~1400 ms ~2500 ms
> 30 ~1500 ms ~1500 ms
> 20 ~1400 ms ~1200 ms
> 10 ~1400 ms ~1200 ms
> ------------------------------------
>
> The IN (VALUES(...)) gives an almost steady state behavior, while the IN
> runtimes deteriorate with growing list size.
>
> There would obviously be different conditions on which to base this
> value. I seek community opinion on this.
>
> --
> Atul
>
> EnterpriseDB
> www.enterprisedb.com
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://postgres.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Cliff Nieuwenhuis 2008-03-11 19:07:40 Re: encoding problems
Previous Message Joshua D. Drake 2008-03-11 18:07:27 Re: Patch for Prevent pg_dump/pg_restore from being affected by statement_timeout