Re: How bad is using queries with thousands of values for operators IN or ANY?

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: Thomas Kellerer <shammat(at)gmx(dot)net>, pgsql-general <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: How bad is using queries with thousands of values for operators IN or ANY?
Date: 2020-09-01 16:46:01
Message-ID: CAFj8pRAVJZypehmFiskRk=AOFbupWMCZRLRp7ocTnXoM8rVFCQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

út 1. 9. 2020 v 9:22 odesílatel Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
napsal:

> At Mon, 31 Aug 2020 16:04:43 +0200, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
> wrote in
> > po 31. 8. 2020 v 13:29 odesílatel Thomas Kellerer <shammat(at)gmx(dot)net>
> napsal:
> >
> > > Thorsten Schöning schrieb am 31.08.2020 um 12:37:
> > > > So for what query size or number of IDs to compare in IN would you
> > > > consider a different approach at all?
> > >
> > >
> > > In my experience "hundreds" of IDs tend to be quite slow if used with
> an
> > > IN clause.
> > >
> > > Rewriting the IN to a JOIN against a VALUES clause is very often
> faster:
> > >
> > > So instead of:
> > >
> > > select *
> > > from t
> > > where id in (1,2,3, .... ,500);
> > >
> > > using this:
> > >
> > > select *
> > > from t
> > > join (
> > > values (1),(2),(3),...(500)
> > > ) as x(id) on x.id = t.id
> > >
> > > produces more often than not a more efficient execution plan (assuming
> no
> > > values are duplicated in the IN list)
> > >
> > > Obviously I don't know if such a re-write is even feasible though.
> > >
> >
> > yes - this query probably will have a slow start, but the execution will
> be
> > fast. Unfortunately, there are not available statistics.
>
> FWIW, the attached is the dusted-off version of a part of a stalled
> development of mine, which unconditionally(!) creates on-the-fly
> statistics on VALUES list. It seems to work for certain cases,
> although the planning time increases significantly.
>
> =$ CREATE TABLE t1 AS SELECT a, a * 2 AS b FROM generate_series(0, 99999)
> a;
> =$ CREATE INDEX ON t1 (a);
> > perl q.pl(*) | psql
>
> *: q.pl:
> > print "explain analyze select b from t1 join (values ";
> > foreach $i (0..10000) {
> > print ", " if ($i > 0);
> > printf("(%d)", $i/10 + 1000);
> > }
> > print ") as v(v) on (v.v = t1.a);";
>
>
> patched:
>
> Merge Join (cost=824.25..1005.19 rows=10001 width=4) (actual
> time=13.513..24.285 rows=10001 loops=1)
> Merge Cond: (t1.a = "*VALUES*".column1)
> -> Index Scan using t1_a_idx on t1 (cost=0.29..3050.29 rows=100000
> width=8) (actual time=0.033..1.629 rows=2002 loops=1)
> -> Sort (cost=789.47..814.47 rows=10001 width=4) (actual
> time=12.557..14.546 rows=10001 loops=1)
> Sort Key: "*VALUES*".column1
> Sort Method: quicksort Memory: 931kB
> -> Values Scan on "*VALUES*" (cost=0.00..125.01 rows=10001
> width=4) (actual time=0.002..8.271 rows=10001 loops=1)
> Planning Time: 17.290 ms
> Execution Time: 26.344 ms
> (9 rows)
>
> master:
> Hash Join (cost=250.03..2168.03 rows=10001 width=4) (actual
> time=14.482..77.205 rows=10001 loops=1)
> Hash Cond: (t1.a = "*VALUES*".column1)
> -> Seq Scan on t1 (cost=0.00..1443.00 rows=100000 width=8) (actual
> time=0.017..23.540 rows=100000 loops=1)
> -> Hash (cost=125.01..125.01 rows=10001 width=4) (actual
> time=13.786..13.788 rows=10001 loops=1)
> Buckets: 16384 Batches: 1 Memory Usage: 480kB
> -> Values Scan on "*VALUES*" (cost=0.00..125.01 rows=10001
> width=4) (actual time=0.002..8.503 rows=10001 loops=1)
> Planning Time: 12.365 ms
> Execution Time: 78.567 ms
> (8 rows)
>
> regards.
>
>
nice :)

Pavel

--
> Kyotaro Horiguchi
> NTT Open Source Software Center
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Maksim Milyutin 2020-09-01 17:10:41 Re: High Availability, guarantee to use sync nodes
Previous Message Michael Lewis 2020-09-01 16:35:51 Re: How bad is using queries with thousands of values for operators IN or ANY?