Re: Improve OR conditions on joined columns (common star schema problem)

From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improve OR conditions on joined columns (common star schema problem)
Date: 2017-02-10 06:18:27
Message-ID: CAGTBQpZ0JYHqsLikPB=ZB+Qzvqu5pjRcpeY97Zmf+g8dMKEmug@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Feb 9, 2017 at 9:50 PM, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com> wrote:
> WHERE t1 IN ('a','b') OR t2 IN ('c','d')
>
> into
>
> WHERE f1 IN (1,2) OR f2 IN (3,4)
>
> (assuming a,b,c,d maps to 1,2,3,4)
>
> BTW, there's an important caveat here: users generally do NOT want duplicate
> rows from the fact table if the dimension table results aren't unique. I
> thought my array solution was equivalent to what the JOINs would do in that
> case but that's actually wrong. The array solution does provide the behavior
> users generally want here though. JOIN is the easiest tool to pick up for
> this, so it's what people gravitate to, but I suspect most users would be
> happier with a construct that worked like the array trick does, but was
> easier to accomplish.
>
> I wonder if any other databases have come up with non-standard syntax to do
> this.

What I've been doing is do those transforms (tn -> fn) in application
code. While it's a chore, the improvement in plans is usually well
worth the trouble.

IF there's a FK between fact and dimension tables, you can be certain
the transform will yield equivalent results, becuase you'll be certain
the joins don't duplicate rows.

So the transform should be rather general and useful.

If you have a join of the form:

a join b on a.f1 = b.id

Where a.f1 has an FK referencing b.id, and a filter on b X of any
form, you can turn the plan into:

with b_ids as (select id from b where X)
...
a join b on a.f1 = b.id and a.f1 in (select id from b_ids)

In order to be useful, the expected row count from b_ids should be rather small.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2017-02-10 06:19:47 Partitioned tables and relfilenode
Previous Message Stephen Frost 2017-02-10 06:04:36 Re: Removal of deprecated views pg_user, pg_group, pg_shadow