Re: Performance improvement for joins where outer side is unique

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Performance improvement for joins where outer side is unique
Date: 2015-02-27 06:23:26
Message-ID: CAApHDvrgFwypvZmWKrgTZSy=f+sewvmfP_GL0h6Pod0cQYAKng@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 3 February 2015 at 22:23, Kyotaro HORIGUCHI <
horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:

> Hi, I had a look on this patch. Although I haven't understood
> whole the stuff and all of the related things, I will comment as
> possible.
>
>
Great, thank you for taking the time to look and review the patch.

> Performance:
>
> I looked on the performance gain this patch gives. For several
> on-memory joins, I had gains about 3% for merge join, 5% for hash
> join, and 10% for nestloop (@CentOS6), for simple 1-level joins
> with aggregation similar to what you mentioned in previous
> mail like this.
>
> =# SELECT count(*) FROM t1 JOIN t2 USING (id);
>
> Of course, the gain would be trivial when returning many tuples,
> or scans go to disk.
>
>
That's true, but joins where rows are filtered after the join condition
will win here too:

For example select * from t1 inner join t2 on t1.id=t2.id where t1.value >
t2.value;

Also queries with a GROUP BY clause. (Since grouping is always performed
after the join :-( )

I haven't measured the loss by additional computation when the
> query is regarded as not a "single join".
>
>
I think this would be hard to measure, but likely if it is measurable then
you'd want to look at just planning time, rather than planning and
execution time.

>
> Explain representation:
>
> I don't like that the 'Unique Join:" occupies their own lines in
> the result of explain, moreover, it doesn't show the meaning
> clearly. What about the representation like the following? Or,
> It might not be necessary to appear there.
>
> Nested Loop ...
> Output: ....
> - Unique Jion: Yes
> -> Seq Scan on public.t2 (cost = ...
> - -> Seq Scan on public.t1 (cost = ....
> + -> Seq Scan on public.t1 (unique) (cost = ....
>
>
>
Yeah I'm not too big a fan of this either. I did at one evolution of the
patch I had "Unique Left Join" in the join node's line in the explain
output, but I hated that more and changed it just to be just in the VERBOSE
output, and after I did that I didn't want to change the join node's line
only when in verbose mode. I do quite like that it's a separate item for
the XML and JSON explain output. That's perhaps quite useful when the
explain output must be processed by software.

I'm totally open for better ideas on names, but I just don't have any at
the moment.

> Coding:
>
> The style looks OK. Could applied on master.
> It looks to work as you expected for some cases.
>
> Random comments follow.
>
> - Looking specialjoin_is_unique_join, you seem to assume that
> !delay_upper_joins is a sufficient condition for not being
> "unique join". The former simplly indicates that "don't
> commute with upper OJs" and the latter seems to indicate that
> "the RHS is unique", Could you tell me what kind of
> relationship is there between them?
>

The rationalisation around that are from the (now changed) version of
join_is_removable(), where the code read:

/*
* Must be a non-delaying left join to a single baserel, else we aren't
* going to be able to do anything with it.
*/
if (sjinfo->jointype != JOIN_LEFT ||
sjinfo->delay_upper_joins)
return false;

I have to admit that I didn't go and investigate why delayed upper joins
cannot be removed by left join removal code, I really just assumed that
we're just unable to prove that a join to such a relation won't match more
than one outer side's row. I kept this to maintain that behaviour as I
assumed it was there for a good reason.

>
> - The naming "unique join" seems a bit obscure for me, but I
> don't have no better idea:( However, the member name
> "has_unique_join" seems to be better to be "is_unique_join".
>
>
Yeah, I agree with this. I just can't find something short enough that
means "based on the join condition, the inner side of the join will never
produce more than 1 row for any single outer row". Unique join was the best
I came up with. I'm totally open for better ideas.

I agree that is_unique_join is better than has_unique_join. I must have
just copied the has_eclass_joins struct member without thinking too hard
about it. I've now changed this in my local copy of the patch.

- eclassjoin_is_unique_join involves seemingly semi-exhaustive
> scan on ec members for every element in joinlist. Even if not,
> it might be thought to be a bit too large for the gain. Could
> you do the equivelant things by more small code?
>
>
I'd imagine some very complex queries could have many equivalence classes
and members, though I hadn't really thought that this number would be so
great that processing here would suffer much. There's quite a few fast
paths out, like the test to ensure that both rels are mentioned
in ec_relids. Though for a query which is so complex to have a great number
of eclass' and members, likely there would be quite a few relations
involved and planning would be slower anyway. I'm not quite sure how else I
could write this and still find the unique join cases each time. We can't
really just give up half way through looking.

Thanks again for the review.

Regards

David Rowley

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Haribabu Kommi 2015-02-27 06:30:46 Re: Providing catalog view to pg_hba.conf file - Patch submission
Previous Message Andrew Gierth 2015-02-27 05:56:18 Re: Reduce pinning in btree indexes