Re: Using each rel as both outer and inner for JOIN_ANTI

From: Ronan Dunklau <ronan(dot)dunklau(at)aiven(dot)io>
To: Richard Guo <guofenglinux(at)gmail(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Emre Hasegeli <emre(at)hasegeli(dot)com>
Subject: Re: Using each rel as both outer and inner for JOIN_ANTI
Date: 2021-06-29 14:40:57
Message-ID: 3490300.zWRC6XL4ym@aivenronan
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Le mardi 29 juin 2021, 10:55:59 CEST Richard Guo a écrit :
> On Tue, Jun 29, 2021 at 3:55 PM Emre Hasegeli <emre(at)hasegeli(dot)com> wrote:
> > > Thanks for the explanation. Attached is a demo code for the hash-join
> > > case, which is only for PoC to show how we can make it work. It's far
> > > from complete, at least we need to adjust the cost calculation for this
> > > 'right anti join'.
> >
> > I applied the patch and executed some queries. Hash Right Anti Joins
> > seem to be working correctly. Though, some of the tests are failing.
> > I guessed it's because the other join algorithms do not support right
> > anti join, but I couldn't reproduce it.
>
> Thanks for verifying this patch.

I also ran the tests on this patch, and can confirm the tests are failing.

The reason for that is that you request a new outer tuple whenever we have a
match, even when the outer tuple could match several tuples from the hash
table: we end up emitting the duplicates because we switched to another tuple
after the first match.

You can set up a simple test case like this:

create table t1 (id int);
create table t2 (id int);
insert into t1 select generate_series (1, 10000);
insert into t2 VALUES (1), (1), (-1), (-1);
analyze t1, t2;

set enable_hashjoin = off;
explain (analyze) select * from t2 where not exists (SELECT 1 FROM t1 where
t1.id = t2.id);
set enable_nestloop = off;
set enable_hashjoin = on;
explain (analyze) select * from t2 where not exists (SELECT 1 FROM t1 where
t1.id = t2.id);

Also, I'm not familiar enough with the hash join algorithm to know if the
approach of "scanning every outer tuple, skip emitting matching inner tuples"
would be correct, but this is the first problem I notice. Not getting into the
HJ_NEED_NEW_OUTER state when the join type is JOIN_RIGHT_ANTI seems to fix this
specific issue tough.

> I think we can basically use the same cost calculation as with anti
> joins, since they share the fact that the executor will stop after the
> first match. However, there are still some differences. Such as when we
> consider the number of tuples that will pass the basic join, I think we
> need to use unmatched inner rows, rather than unmatched outer rows.

Due to the fact we cannot just skip at the first match, I'm not sure this works
either.

>
> Thanks
> Richard

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message gkokolatos 2021-06-29 14:45:17 Teach pg_receivewal to use lz4 compression
Previous Message Alexey Kondratov 2021-06-29 14:34:49 Re: Supply restore_command to pg_rewind via CLI argument