Re: How bad is using queries with thousands of values for operators IN or ANY?

From: Michael Lewis <mlewis(at)entrata(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, Thomas Kellerer <shammat(at)gmx(dot)net>, PostgreSQL General <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: How bad is using queries with thousands of values for operators IN or ANY?
Date: 2020-09-01 16:35:51
Message-ID: CAHOFxGq+1OdxQZ0prKEgoGYTL6sfrz3aHB6xdQKFOxeJsDRA2w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Tue, Sep 1, 2020 at 1:22 AM Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
wrote:

> FWIW, the attached is the dusted-off version of a part of a stalled
> development of mine, which unconditionally(!) creates on-the-fly
> statistics on VALUES list. It seems to work for certain cases,
> although the planning time increases significantly.
>
> =$ CREATE TABLE t1 AS SELECT a, a * 2 AS b FROM generate_series(0, 99999)
> a;
> =$ CREATE INDEX ON t1 (a);
> > perl q.pl(*) | psql
>
> *: q.pl:
> > print "explain analyze select b from t1 join (values ";
> > foreach $i (0..10000) {
> > print ", " if ($i > 0);
> > printf("(%d)", $i/10 + 1000);
> > }
> > print ") as v(v) on (v.v = t1.a);";
>
>
> patched:
>
> Merge Join (cost=824.25..1005.19 rows=10001 width=4) (actual
> time=13.513..24.285 rows=10001 loops=1)
> Merge Cond: (t1.a = "*VALUES*".column1)
> -> Index Scan using t1_a_idx on t1 (cost=0.29..3050.29 rows=100000
> width=8) (actual time=0.033..1.629 rows=2002 loops=1)
> -> Sort (cost=789.47..814.47 rows=10001 width=4) (actual
> time=12.557..14.546 rows=10001 loops=1)
> Sort Key: "*VALUES*".column1
> Sort Method: quicksort Memory: 931kB
> -> Values Scan on "*VALUES*" (cost=0.00..125.01 rows=10001
> width=4) (actual time=0.002..8.271 rows=10001 loops=1)
> Planning Time: 17.290 ms
> Execution Time: 26.344 ms
> (9 rows)
>
> master:
> Hash Join (cost=250.03..2168.03 rows=10001 width=4) (actual
> time=14.482..77.205 rows=10001 loops=1)
> Hash Cond: (t1.a = "*VALUES*".column1)
> -> Seq Scan on t1 (cost=0.00..1443.00 rows=100000 width=8) (actual
> time=0.017..23.540 rows=100000 loops=1)
> -> Hash (cost=125.01..125.01 rows=10001 width=4) (actual
> time=13.786..13.788 rows=10001 loops=1)
> Buckets: 16384 Batches: 1 Memory Usage: 480kB
> -> Values Scan on "*VALUES*" (cost=0.00..125.01 rows=10001
> width=4) (actual time=0.002..8.503 rows=10001 loops=1)
> Planning Time: 12.365 ms
> Execution Time: 78.567 ms
> (8 rows)
>

We end up abusing the option of creating temp tables and analyzing them to
get around the pain of queries going off the rails because of bad stats or
lack of stats on values. I believe most/all of the core team and perhaps
most contributors are against query hints in general (with some very good
reasons) but it might be amazing to have the option to incur the planning
time cost in some cases at least.

For my case, I think the changes coming in PG v13 or maybe v14 for joins
being helped by multivariate statistics will mitigate the pain point.

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Pavel Stehule 2020-09-01 16:46:01 Re: How bad is using queries with thousands of values for operators IN or ANY?
Previous Message David G. Johnston 2020-09-01 16:30:30 Re: Is it possible to set end-of-data marker for COPY statement.