Re: Improve OR conditions on joined columns (common star schema problem)

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improve OR conditions on joined columns (common star schema problem)
Date: 2017-02-12 00:30:35
Message-ID: 4343.1486859435@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com> writes:
> On 2/8/17 5:54 PM, Tom Lane wrote:
>> Maybe it'd be better to imagine this as something closer to planagg.c,
>> that is it knows how to apply a specific high-level optimization that
>> might or might not be a win, so it builds a path describing that and sees
>> if it looks cheaper than a path done the normal way. The fact that
>> we could even build a path describing a union is something that wasn't
>> there before 9.6, but maybe there's enough infrastructure for that now.

> It's encouraging that there's at least a template to follow there... If
> there is missing infrastructure, is it likely to be major? My uninformed
> guess would be that the pathification was the major hurdle.

I wrote a POC patch for this on a long airplane ride. It's not complete,
and I'm sure there are bugs as well, but it makes your example case
better. What I did about the de-duplication issue is to de-dup using
the CTIDs of all baserels in the query. This means the optimization
is only applicable to cases where all the rels have CTIDs ... but other
methods such as inspecting unique keys ain't gonna work for non-table
rels either, so I think this is about the best we can hope for.
However, I did not understand your point about:

> BTW, there's an important caveat here: users generally do NOT want
> duplicate rows from the fact table if the dimension table results aren't
> unique.

so maybe we should be thinking about some other way entirely?

Anyway, what's attached below produces this on your example query:

Aggregate (cost=38.12..38.13 rows=1 width=8) (actual time=0.158..0.158 rows=1 loops=1)
-> Result (cost=38.03..38.11 rows=4 width=0) (actual time=0.151..0.155 rows=3 loops=1)
-> Unique (cost=38.03..38.07 rows=4 width=18) (actual time=0.150..0.154 rows=3 loops=1)
-> Sort (cost=38.03..38.04 rows=4 width=18) (actual time=0.150..0.151 rows=4 loops=1)
Sort Key: f.ctid, d1.ctid, d2.ctid
Sort Method: quicksort Memory: 25kB
-> Append (cost=4.85..37.99 rows=4 width=18) (actual time=0.070..0.106 rows=4 loops=1)
-> Nested Loop Left Join (cost=4.85..19.00 rows=2 width=18) (actual time=0.069..0.075 rows=2 loops=1)
-> Nested Loop (cost=4.57..18.37 rows=2 width=16) (actual time=0.035..0.038 rows=2 loops=1)
-> Index Scan using dim_t_key on dim d1 (cost=0.28..8.29 rows=1 width=10) (actual time=0.009..0.009 rows=1 loops=1)
Index Cond: ('1'::text = t)
-> Bitmap Heap Scan on fact f (cost=4.30..10.05 rows=2 width=14) (actual time=0.023..0.025 rows=2 loops=1)
Recheck Cond: (f1 = d1.s)
Heap Blocks: exact=2
-> Bitmap Index Scan on f_f1 (cost=0.00..4.29 rows=2 width=0) (actual time=0.016..0.016 rows=2 loops=1)
Index Cond: (f1 = d1.s)
-> Index Scan using dim_pkey on dim d2 (cost=0.28..0.31 rows=1 width=10) (actual time=0.016..0.017 rows=0 loops=2)
Index Cond: (f.f2 = s)
-> Nested Loop Left Join (cost=4.85..19.00 rows=2 width=18) (actual time=0.025..0.029 rows=2 loops=1)
-> Nested Loop (cost=4.57..18.37 rows=2 width=16) (actual time=0.022..0.025 rows=2 loops=1)
-> Index Scan using dim_t_key on dim d2 (cost=0.28..8.29 rows=1 width=10) (actual time=0.003..0.003 rows=1 loops=1)
Index Cond: ('1'::text = t)
-> Bitmap Heap Scan on fact f (cost=4.30..10.05 rows=2 width=14) (actual time=0.017..0.019 rows=2 loops=1)
Recheck Cond: (f2 = d2.s)
Heap Blocks: exact=2
-> Bitmap Index Scan on f_f2 (cost=0.00..4.29 rows=2 width=0) (actual time=0.014..0.014 rows=2 loops=1)
Index Cond: (f2 = d2.s)
-> Index Scan using dim_pkey on dim d1 (cost=0.28..0.31 rows=1 width=10) (actual time=0.001..0.001 rows=0 loops=2)
Index Cond: (f.f1 = s)

The main remaining piece of work here is that, as you can see from the
above, it fails to eliminate joins to tables that we don't actually need
in a particular UNION arm. This is because the references to those
tables' ctid columns prevent analyzejoins.c from removing the joins.
I've thought about ways to deal with that but haven't come up with
anything that wasn't pretty ugly and/or invasive.

regards, tom lane

Attachment Content-Type Size
join-or-to-union-1.patch text/x-diff 23.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2017-02-12 01:44:21 Re: Parallel Index Scans
Previous Message Greg Stark 2017-02-12 00:15:56 Re: \if, \elseif, \else, \endif (was Re: PSQL commands: \quit_if, \quit_unless)