Re: Improve OR conditions on joined columns (common star schema problem)

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improve OR conditions on joined columns (common star schema problem)
Date: 2017-02-14 19:18:54
Message-ID: 22002.1487099934@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

I wrote:
> The main remaining piece of work here is that, as you can see from the
> above, it fails to eliminate joins to tables that we don't actually need
> in a particular UNION arm. This is because the references to those
> tables' ctid columns prevent analyzejoins.c from removing the joins.
> I've thought about ways to deal with that but haven't come up with
> anything that wasn't pretty ugly and/or invasive.

I thought of a way that wasn't too awful, which was to inject the requests
for CTID columns only after we've done join removal. The attached v2
patch produces this for your test case:

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=36.84..36.85 rows=1 width=8) (actual time=0.075..0.075 rows=1 loops=1)
-> Result (cost=36.77..36.83 rows=4 width=0) (actual time=0.069..0.073 rows=3 loops=1)
-> Unique (cost=36.77..36.79 rows=4 width=6) (actual time=0.069..0.072 rows=3 loops=1)
-> Sort (cost=36.77..36.78 rows=4 width=6) (actual time=0.068..0.069 rows=4 loops=1)
Sort Key: f.ctid
Sort Method: quicksort Memory: 25kB
-> Append (cost=4.57..36.73 rows=4 width=6) (actual time=0.018..0.033 rows=4 loops=1)
-> Nested Loop (cost=4.57..18.37 rows=2 width=6) (actual time=0.018..0.020 rows=2 loops=1)
-> Index Scan using dim_t_key on dim d1 (cost=0.28..8.29 rows=1 width=10) (actual time=0.005..0.005 rows=1 loops=1)
Index Cond: ('1'::text = t)
-> Bitmap Heap Scan on fact f (cost=4.30..10.05 rows=2 width=14) (actual time=0.010..0.012 rows=2 loops=1)
Recheck Cond: (f1 = d1.s)
Heap Blocks: exact=2
-> Bitmap Index Scan on f_f1 (cost=0.00..4.29 rows=2 width=0) (actual time=0.006..0.006 rows=2 loops=1)
Index Cond: (f1 = d1.s)
-> Nested Loop (cost=4.57..18.37 rows=2 width=6) (actual time=0.009..0.011 rows=2 loops=1)
-> Index Scan using dim_t_key on dim d2 (cost=0.28..8.29 rows=1 width=10) (actual time=0.003..0.003 rows=1 loops=1)
Index Cond: ('1'::text = t)
-> Bitmap Heap Scan on fact f (cost=4.30..10.05 rows=2 width=14) (actual time=0.005..0.006 rows=2 loops=1)
Recheck Cond: (f2 = d2.s)
Heap Blocks: exact=2
-> Bitmap Index Scan on f_f2 (cost=0.00..4.29 rows=2 width=0) (actual time=0.003..0.003 rows=2 loops=1)
Index Cond: (f2 = d2.s)
Planning time: 0.732 ms
Execution time: 0.156 ms
(25 rows)

I think this might be code-complete, modulo the question of whether we
want an enabling GUC for it. I'm still concerned about the question
of whether it adds more planning time than it's worth for most users.
(Obviously it needs some regression test cases too, and a lot more
real-world testing than I've given it.)

One point that could use further review is whether the de-duplication
algorithm is actually correct. I'm only about 95% convinced by the
argument I wrote in planunionor.c's header comment.

Potential future work includes finding join ORs in upper-level INNER JOIN
ON clauses, and improving matters so we can do the unique-ification with
hashing as well as sorting. I don't think either of those things has to
be in the first commit, though.

regards, tom lane

Attachment Content-Type Size
join-or-to-union-2.patch text/x-diff 29.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jim Nasby 2017-02-14 19:53:02 Official adoption of PGXN (was: removing tsearch2)
Previous Message Tomas Vondra 2017-02-14 19:05:20 Re: PATCH: two slab-like memory allocators