Re: Performance improvement for joins where outer side is unique

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Performance improvement for joins where outer side is unique
Date: 2017-03-14 05:37:24
Message-ID: CAKJS1f8iD5mDMBsCRvV9XnrwQy9xj4R2qyTQLFaCJGAnG_UOHA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 14 March 2017 at 11:35, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
wrote:

> On 14 March 2017 at 07:50, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
>> [ getting back to this patch finally... ]
>>
>> David Rowley <david(dot)rowley(at)2ndquadrant(dot)com> writes:
>> > I've attached a patch which implements this, though only for
>> > MergeJoin, else I'd imagine we'd also need to ensure all proofs used
>> > for testing the uniqueness were also hash-able too. I added some XXX
>> > comments in analyzejoin.c around the mergeopfamilies == NIL tests to
>> > mention that Merge Join depends on all the unique proof quals having
>> > mergeopfamilies. This also assumes we'll never use some subset of
>> > mergejoin-able quals for a merge join, which could be an interesting
>> > area in the future, as we might have some btree index on a subset of
>> > those columns to provide pre-sorted input. In short, it's a great
>> > optimisation, but I'm a little scared we might break it one day.
>>
>> Umm ... it's broken already isn't it? See the comment for
>> generate_mergejoin_paths:
>>
>> Thanks for looking at this again.
>
> Yeah confirmed. It's broken. I guess I just need to remember in the Path
> if we got all the join quals, although I've not looked in detail what the
> best fix is. I imagine it'll require storing something else in the JoinPath.
>

OK, so I've spent some more time on this and I've come up with a solution.

Basically the solution is to not skip mark and restore when joinquals
contains any items. This is a requirement for SEMI joins, but overly
cautious for unique joins. However, I think it'll apply in most cases when
Merge Join will be a win, and that's when there's a btree index on both
sides of the join which covers all columns in the join condition. I
carefully commented this part of the code to explain what can be done to
have it apply in more cases.

This caused me to go and change the following code too:

@@ -2676,6 +2688,9 @@ final_cost_mergejoin(PlannerInfo *root, MergePath
*path,
* it off does not entitle us to deliver an invalid plan.
*/
else if (innersortkeys == NIL &&
+ !((extra->inner_unique || path->jpath.jointype == JOIN_SEMI) &&
+ list_length(path->jpath.joinrestrictinfo) ==
+ list_length(path->path_mergeclauses)) &&
!ExecSupportsMarkRestore(inner_path))
path->materialize_inner = true;

I've been staring at this for a while and I think it's correct. If it's
wrong then we'd get "ERROR: unrecognized node type: <n>" from ExecMarkPos().

Here we must make sure and never skip materializing the inner side, if
we're not going to skip mark and restore in ExecInitMergeJoin().

I've attached a patch which fixes the issue.

Also added a regression test to try to make sure it stays fixed. There's a
few other mostly cosmetic fixes in there too.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
unique_joins_2017-03-14a.patch application/octet-stream 95.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Beena Emerson 2017-03-14 05:44:09 Re: increasing the default WAL segment size
Previous Message Pavel Stehule 2017-03-14 05:16:25 ToDo: listagg is in ANSI/SQL:2016