WHERE IN for JOIN subquery?

From: Dave Johansen <davejohansen(at)gmail(dot)com>
To: pgsql-performance <pgsql-performance(at)postgresql(dot)org>
Subject: WHERE IN for JOIN subquery?
Date: 2017-12-19 00:00:05
Message-ID: CAAcYxUfsY+t49bDVT4um2-E4qUb6iyEudMeLR0n8t9zTwEvX0g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

I want to do a JOIN against a subquery that is doing an aggregation. The
query itself is relatively straightforward, but has poor performance.

Here it is:
SELECT a.*, b.*
FROM base AS a
LEFT OUTER JOIN
(SELECT other, COUNT(value), COUNT(DISTINCT value) FROM other GROUP
BY other) AS b
USING (other)
WHERE id IN (4, 56, 102);

It's significantly faster, but more complicated (and repetitive), if I add
the following:
WHERE other = ANY(ARRAY(SELECT DISTINCT other FROM base WHERE id IN (4, 56,
102)))

I tried adding the following:
other IN (a.other)
Or:
other = a.other
But I get this error:
ERROR: invalid reference to FROM-clause entry for table "a"

LINE 1: ...ue), COUNT(DISTINCT value) FROM other WHERE other=a.other GR...

^

HINT: There is an entry for table "a", but it cannot be referenced from
this part of the query.

Is there a way to do something like that simpler query so the subquery can
get better performance by filtering only to what it needs instead of doing
the GROUP BY on the whole table?

Thanks,
Dave

In case it's helpful, here's the table definitions:
CREATE TABLE base (id INTEGER PRIMARY KEY, value TEXT, other INTEGER);
CREATE TABLE other (other INTEGER, value INTEGER);

And the explain results:
EXPLAIN ANALYZE SELECT a.*, b.* FROM base AS a LEFT OUTER JOIN (SELECT
other, COUNT(value), COUNT(DISTINCT value) FROM other WHERE other =
ANY(ARRAY(SELECT DISTINCT other FROM base WHERE id IN (4, 56, 102))) GROUP
BY other) AS b USING (other) WHERE id IN (4, 56, 102);

QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------

Merge Right Join (cost=27619.21..27741.23 rows=3 width=33) (actual
time=105.045..115.539 rows=3 loops=1)

Merge Cond: (other.other = a.other)

-> GroupAggregate (cost=27602.28..27711.74 rows=1001 width=20) (actual
time=104.989..115.452 rows=3 loops=1)

Group Key: other.other

InitPlan 1 (returns $0)

-> Unique (cost=16.93..16.95 rows=3 width=4) (actual
time=0.083..0.127 rows=3 loops=1)

-> Sort (cost=16.93..16.94 rows=3 width=4) (actual
time=0.073..0.085 rows=3 loops=1)

Sort Key: base.other

Sort Method: quicksort Memory: 25kB

-> Index Scan using base_pkey on base
(cost=0.29..16.91 rows=3 width=4) (actual time=0.019..0.042 rows=3 loops=1)

Index Cond: (id = ANY
('{4,56,102}'::integer[]))

-> Sort (cost=27585.34..27610.20 rows=9945 width=8) (actual
time=99.401..107.199 rows=3035 loops=1)

Sort Key: other.other

Sort Method: quicksort Memory: 239kB

-> Seq Scan on other (cost=0.00..26925.00 rows=9945
width=8) (actual time=0.708..90.738 rows=3035 loops=1)

Filter: (other = ANY ($0))

Rows Removed by Filter: 996965

-> Sort (cost=16.93..16.94 rows=3 width=13) (actual time=0.044..0.051
rows=3 loops=1)

Sort Key: a.other

Sort Method: quicksort Memory: 25kB

-> Index Scan using base_pkey on base a (cost=0.29..16.91 rows=3
width=13) (actual time=0.016..0.027 rows=3 loops=1)

Index Cond: (id = ANY ('{4,56,102}'::integer[]))

Planning time: 4.163 ms

Execution time: 115.665 ms

EXPLAIN ANALYZE SELECT a.*, b.* FROM base AS a LEFT OUTER JOIN (SELECT
other, COUNT(value), COUNT(DISTINCT value) FROM other GROUP BY other) AS b
USING (other) WHERE id IN (4, 56, 102);

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------

Merge Right Join (cost=127786.02..137791.07 rows=3 width=60) (actual
time=7459.042..12060.805 rows=3 loops=1)

Merge Cond: (other.other = a.other)

-> GroupAggregate (cost=127763.19..137765.69 rows=200 width=20)
(actual time=7143.486..12057.835 rows=830 loops=1)

Group Key: other.other

-> Sort (cost=127763.19..130263.31 rows=1000050 width=8) (actual
time=7137.594..9624.119 rows=829088 loops=1)

Sort Key: other.other

Sort Method: external merge Disk: 17576kB

-> Seq Scan on other (cost=0.00..14425.50 rows=1000050
width=8) (actual time=0.555..2727.461 rows=1000000 loops=1)

-> Sort (cost=22.83..22.84 rows=3 width=40) (actual time=0.103..0.112
rows=3 loops=1)

Sort Key: a.other

Sort Method: quicksort Memory: 25kB

-> Bitmap Heap Scan on base a (cost=12.87..22.81 rows=3
width=40) (actual time=0.048..0.064 rows=3 loops=1)

Recheck Cond: (id = ANY ('{4,56,102}'::integer[]))

Heap Blocks: exact=1

-> Bitmap Index Scan on base_pkey (cost=0.00..12.87 rows=3
width=0) (actual time=0.029..0.029 rows=3 loops=1)

Index Cond: (id = ANY ('{4,56,102}'::integer[]))

Planning time: 2.179 ms

Execution time: 12080.172 ms

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message David G. Johnston 2017-12-19 00:10:34 Re: WHERE IN for JOIN subquery?
Previous Message Justin Pryzby 2017-12-17 02:37:01 Re: Bitmap scan is undercosted? - overestimated correlation and cost_index