Re: [PERFORM] Hash Anti Join performance degradation

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>, panam <panam(at)gmx(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PERFORM] Hash Anti Join performance degradation
Date: 2011-06-01 03:47:42
Message-ID: 7913.1306900062@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> With respect to the root of the issue (why does the anti-join take so
> long?), my first thought was that perhaps the OP was very unlucky and
> had a lot of values that hashed to the same bucket. But that doesn't
> appear to be the case.

Well, yes it is. Notice what the subquery is doing: for each row in
"box", it's pulling all matching "box_id"s from message and running a
self-join across those rows. The hash join condition is a complete
no-op. And some of the box_ids have hundreds of thousands of rows.

I'd just write it off as being a particularly stupid way to find the
max(), except I'm not sure why deleting just a few thousand rows
improves things so much. It looks like it ought to be an O(N^2)
situation, so the improvement should be noticeable but not amazing.

And if you force it to not use a hashjoin, suddenly things are better.
Nestloop should also be O(N^2) in this situation, but seemingly it
avoids whatever weird corner case we are hitting here.

As Cedric says, the lack of any CHECK_FOR_INTERRUPTS in this loop is
also problematic. I'm not sure that right there is an ideal place
to put it, but we need one somewhere in the loop.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2011-06-01 03:50:38 Re: BUG #6046: select current_date crashes postgres
Previous Message Tom Lane 2011-06-01 03:34:32 Re: Any idea for serializing INSERTING SERIAL column?

Browse pgsql-performance by date

  From Date Subject
Next Message Greg Smith 2011-06-01 06:14:11 Re: Delete performance
Previous Message Robert Haas 2011-06-01 02:58:16 Re: [PERFORM] Hash Anti Join performance degradation