Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Jim Nasby <jim(dot)nasby(at)openscg(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)
Date: 2018-04-01 14:51:22
Message-ID: CAKJS1f-rdJ4XoQ1D32zCmxei3soKrK_hpuDRz4yFwnm0tmAwdg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 3 February 2018 at 03:26, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> writes:
>> ISTM this patch got somewhat stuck as we're not quite sure the
>> transformation is correct in all cases. Is my impression correct?
>
> Yeah, that's the core issue.
>
>> If yes, how to we convince ourselves? Would some sort of automated testing
>> (generating data and queries) help? I'm willing to spend some cycles on
>> that, if considered helpful.
>
> I'm not sure if that would be enough to convince doubters. On the other
> hand, if it found problems, that would definitely be useful.

I've read over this thread and as far as I can see there is concern
that the UNION on the ctids might not re-uniquify the rows again. Tom
mentioned a problem with FULL JOINs, but the concern appears to have
been invalidated due to wrong thinking about how join removals work.

As of now, I'm not quite sure who exactly is concerned. Tom thought
there was an issue but quickly corrected himself.

As far as I see it, there are a few cases we'd need to disable the optimization:

1. Query contains target SRFs (the same row might get duplicated, we
don't want to UNION these duplicates out again, they're meant to be
there)
2. Query has setops (ditto)
3. Any base rels are not RELKIND_RELATION (we need the ctid to
uniquely identify rows)
4. Query has volatile functions (don't want to evaluate volatile
functions more times than requested)

As far as the DISTINCT clause doing the right thing for joins, I see
no issues, even with FULL JOINs. In both branches of the UNION the
join condition will be the same so each side of the join always has
the same candidate row to join to. I don't think the optimization is
possible if there are OR clauses in the join condition, but that's not
being proposed.

FULL JOINS appear to be fine as the row is never duplicated on a
non-match, so there will only be one version of t1.ctid, NULL::tid or
NULL::tid, t1.ctid and all ctids in the distinctClauses cannot all be
NULL at once.

I used the following SQL to help my brain think through this. There
are two versions of each query, one with DISTINCT and one without. If
the DISTINCT returns fewer rows than the one without then we need to
disable this optimization for that case. I've written queries for 3 of
the above 4 cases. I saw from reading the thread that case #4 is
already disabled:

drop table if exists t1,t2;
create table t1 (a int);
create table t2 (a int);

insert into t1 values(1),(1),(2),(4);
insert into t2 values(1),(1),(3),(3),(4),(4);

select t1.ctid,t2.ctid from t1 full join t2 on t1.a = t2.a;

select distinct t1.ctid,t2.ctid from t1 full join t2 on t1.a = t2.a;

-- case 1: must disable in face of tSRFs

select ctid from (select ctid,generate_Series(1,2) from t1) t;

select distinct ctid from (select ctid,generate_Series(1,2) from t1) t;

-- case 2: must disable optimization with setops.

select ctid from (select ctid from t1 union all select ctid from t1) t;

select distinct ctid from (select ctid from t1 union all select ctid from t1) t;

-- case 3: must disable if we join to anything other than a
RELKIND_RELATION (no ctid)

select ctid from (select t1.ctid from t1 inner join (values(1),(1))
x(x) on t1.a = x.x) t;

select distinct ctid from (select t1.ctid from t1 inner join
(values(1),(1)) x(x) on t1.a = x.x) t;

I've not read the patch yet, but I understand what it's trying to
achieve. My feelings about the patch are that it would be useful to
have. I think if someone needs this then they'll be very happythat
we've added it. I also think there should be a GUC to disable it, and
it should be enabled through the entire alpha/beta period, and we
should consider what the final value for it should be just before RC1.
It's a bit sad to exclude foreign tables, and I'm not too sure what
hurdles this leaves out for pluggable storage. No doubt we'll need to
disable the optimization for those too unless they can provide us with
some row identifier.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2018-04-01 14:54:34 Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)
Previous Message Stephen Frost 2018-04-01 13:39:02 Re: Add default role 'pg_access_server_files'