Re: Avoiding hash join batch explosions with extreme skew and weird stats

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Avoiding hash join batch explosions with extreme skew and weird stats
Date: 2019-06-07 14:30:18
Message-ID: CA+TgmobqYECs_+bv+GuMZ+dqUPJpD3RpcZnXe6MyCrJ4MGy7QA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jun 6, 2019 at 7:31 PM Melanie Plageman
<melanieplageman(at)gmail(dot)com> wrote:
> I'm not sure I understand why you would need to compare the original
> tuples to the unmatched tuples file.

I think I was confused. Actually, I'm still not sure I understand this part:

> Then, you iterate again through the outer side a third time to join it
> to the unmatched tuples in the unmatched tuples file (from the first
> chunk) and write the following to a new unmatched tuples file:
> 5
> 9
> 11
> 11

and likewise here

> Then you iterate a fifth time through the outer side to join it to the
> unmatched tuples in the unmatched tuples file (from the second chunk)
> and write the following to a new unmatched tuples file:
> 11
> 11

So you refer to joining the outer side to the unmatched tuples file,
but how would that tell you which outer tuples had no matches on the
inner side? I think what you'd need to do is anti-join the unmatched
tuples file to the current inner batch. So the algorithm would be
something like:

for each inner batch:
for each outer tuple:
if tuple matches inner batch then emit match
if tuple does not match inner batch and this is the first inner batch:
write tuple to unmatched tuples file
if this is not the first inner batch:
for each tuple from the unmatched tuples file:
if tuple does not match inner batch:
write to new unmatched tuples file
discard previous unmatched tuples file and use the new one for the
next iteration

for each tuple in the final unmatched tuples file:
null-extend and emit

If that's not what you have in mind, maybe you could provide some
similar pseudocode? Or you can just ignore me. I'm not trying to
interfere with an otherwise-fruitful discussion by being the only one
in the room who is confused...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2019-06-07 14:47:50 Re: Avoiding hash join batch explosions with extreme skew and weird stats
Previous Message Tomas Vondra 2019-06-07 14:17:40 Re: Avoiding hash join batch explosions with extreme skew and weird stats