Re: Removing unneeded self joins

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Removing unneeded self joins
Date: 2019-03-20 22:54:49
Message-ID: CAKJS1f8p-KiEujr12k-oa52JNWWaQUjEjNg+o1MGZk4mHBn_Rg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 21 Mar 2019 at 01:20, Alexander Kuzmenkov
<a(dot)kuzmenkov(at)postgrespro(dot)ru> wrote:
> Let's recap the conditions when we can remove a self-join. It is when
> for each outer row, 1) at most one inner row matches the join clauses,
> and 2) it is the same row as the outer one. I'm not sure what (2) means
> precisely in a general case, but for a plain table, we can identify
> these rows by ctid. So when both sides have the same unique index with
> the same clauses, we conclude that we are always dealing with the same
> row (as identified by ctid) on both sides, hence the join can be
> replaced with a scan.
>
> The code I wrote just checks for the above conditions. The data we need
> for these checks is a byproduct of checking the relations for
> uniqueness, which we do anyway, so we just cache it for a negligible cost.
>
> I didn't write it in a more generic way because I don't understand the
> conditions for generic case. In your DISTINCT example, the join can be
> removed indeed. But if we select some columns from the inner side apart
> from the join ones, we can't remove the join anymore:
>
> select * from t1, (select distinct on (a) a, b from t1) tt where t1.a =
> tt.a;
>
> I think this might be a different kind of optimization, where we remove
> the self-join if the inner side is unique, and no inner columns are
> selected besides the join ones.
>
>
> Also, reading your letter I realized that I don't commute the index
> clauses correctly before comparing them in is_unique_self_join, so I
> fixed this in the new version of the patch.

I really just don't think checking the unique indexes match is going
to cut it. You should be looking for a unique index where the join
clauses match on either side of the join, not looking independently
and then checking the indexes are the same ones.

Here's an example of what can go wrong with your current code:

drop table abc;
create table abc(a int, b int, c int);
create unique index on abc(a);
create unique index on abc(b);
create unique index on abc(c);
explain select * from abc a1 inner join abc a2 on a1.a=a2.b and a1.c=a2.c;
QUERY PLAN
---------------------------------------------------------
Seq Scan on abc a2 (cost=0.00..35.50 rows=10 width=24)
Filter: ((c IS NOT NULL) AND (a = b))
(2 rows)

The above seems fine, but let's try again, this time change the order
that the indexes are defined.

drop table abc;
create table abc(a int, b int, c int);
create unique index on abc(a);
create unique index on abc(c);
create unique index on abc(b);
explain select * from abc a1 inner join abc a2 on a1.a=a2.b and a1.c=a2.c;
QUERY PLAN
-----------------------------------------------------------------------
Hash Join (cost=61.00..102.11 rows=1 width=24)
Hash Cond: ((a1.a = a2.b) AND (a1.c = a2.c))
-> Seq Scan on abc a1 (cost=0.00..30.40 rows=2040 width=12)
-> Hash (cost=30.40..30.40 rows=2040 width=12)
-> Seq Scan on abc a2 (cost=0.00..30.40 rows=2040 width=12)
(5 rows)

oops. I think behaviour like this that depends on the order that
indexes are created is not going to cut it. Probably you maybe could
restrict the join qual list to just quals that have the same expr on
either side of the join, that way you could still use
innerrel_is_unique() to check the inner rel is unique. You'll likely
need to pass force_cache as false though, since you don't want to
cache non-uniqueness with a subset of joinquals. Doing that could
cause unique joins not to work when the join search is done via GEQO.
I also think this way would give you the subquery GROUP BY / DISTINCT
self join removal for just about free. However, there might be more
cleanup to be done in that case... I've not thought about that too
hard.

I'm going to set this to waiting on author, as that's quite a decent
sized change.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Julien Rouhaud 2019-03-20 22:56:07 Re: [survey] New "Stable" QueryId based on normalized query text
Previous Message Tomas Vondra 2019-03-20 22:22:44 Re: PostgreSQL pollutes the file system