Re: Using each rel as both outer and inner for JOIN_ANTI

From: Richard Guo <guofenglinux(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Ronan Dunklau <ronan(dot)dunklau(at)aiven(dot)io>, Pg Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Emre Hasegeli <emre(at)hasegeli(dot)com>
Subject: Re: Using each rel as both outer and inner for JOIN_ANTI
Date: 2022-08-02 07:13:55
Message-ID: CAMbWs4_nLMSX75YtjECe77h6cTxDG2GFRX6cT1tg1=ugtEAWbg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Jul 31, 2022 at 12:07 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Richard Guo <guofenglinux(at)gmail(dot)com> writes:
> > [ v4-0001-Using-each-rel-as-both-outer-and-inner-for-anti-j.patch ]
>
> I took a quick look through this. The executor changes are indeed
> impressively short, but that's largely because you've paid zero
> attention to updating obsoleted comments. For example, in
> nodeHashjoin.c there are lots of references to right/full joins
> that likely now need to cover right-anti. I'm not sure that the
> empty-rel startup optimizations are correct for this case, either.

Thanks for the review! Yeah, you're right. I neglected to update the
related comments. Will do that in the new patch. For the empty-rel
startup optimizations, since the right-anti join also does null-fill on
inner relation (the HJ_FILL_INNER case), I think we cannot skip building
the hash table even when the outer rel is completely empty.

> I don't have a warm feeling about the planner changes being correct:
> it looks like what you mostly did was to treat JOIN_RIGHT_ANTI
> identically to JOIN_ANTI everywhere, which is surely wrong.
> As an example of this, optimizer/README mentions
>
> Similarly, parameterized paths do not normally get preference in add_path
> for having cheap startup cost; that's seldom of much value when on the
> inside of a nestloop, so it seems not worth keeping extra paths solely
> for
> that. An exception occurs for parameterized paths for the RHS relation
> of
> a SEMI or ANTI join: in those cases, we can stop the inner scan after the
> first match, so it's primarily startup not total cost that we care about.
>
> For RIGHT_ANTI it'd become startup of the outer scan that counts, but
> I don't think you've gotten that right here.

I think JOIN_RIGHT_ANTI behaves more like JOIN_RIGHT, except that
JOIN_RIGHT returns a matched tuple while JOIN_RIGHT_ANTI does not. For
each outer tuple, right-anti needs to scan the inner rel for every match
and mark its hashtable entry. So I think the right-anti join should not
belong to the case 'in those cases, we can stop the inner scan after the
first match, so it's primarily startup not total cost that we care
about.' Am I thinking it correctly?

> [ wanders away wondering if JOIN_RIGHT_SEMI should become a thing ... ]

Maybe this is something we can do. Currently for the query below:

# explain select * from foo where a in (select c from bar);
QUERY PLAN
-------------------------------------------------------------------------
Hash Semi Join (cost=154156.00..173691.29 rows=10 width=8)
Hash Cond: (foo.a = bar.c)
-> Seq Scan on foo (cost=0.00..1.10 rows=10 width=8)
-> Hash (cost=72124.00..72124.00 rows=5000000 width=4)
-> Seq Scan on bar (cost=0.00..72124.00 rows=5000000 width=4)
(5 rows)

I believe we can get a cheaper plan if we are able to swap the outer and
inner for SEMI JOIN and use the smaller 'foo' as inner rel.

Thanks
Richard

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2022-08-02 07:19:28 Re: Skip partition tuple routing with constant partition key
Previous Message Alvaro Herrera 2022-08-02 06:44:04 Re: [Commitfest 2022-07] is Done!