Re: [PERFORM] Hash Anti Join performance degradation

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>, panam <panam(at)gmx(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PERFORM] Hash Anti Join performance degradation
Date: 2011-06-01 20:43:29
Message-ID: BANLkTi=6pGhGOUddirQmEqavJ3_T-y4xFA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

On Wed, Jun 1, 2011 at 4:35 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Wed, Jun 1, 2011 at 4:25 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Because of the way that a bitmap heap scan works, the rows are
>>> guaranteed to be loaded into the hash table in physical order, which
>>> means (in the fast case) that the row with the largest "id" value gets
>>> loaded last.  And because ExecHashTableInsert pushes each row onto the
>>> front of its hash chain, that row ends up at the front of the hash
>>> chain.  Net result: for all the outer rows that aren't the one with
>>> maximum id, we get a joinqual match at the very first entry in the hash
>>> chain.  Since it's an antijoin, we then reject that outer row and go
>>> on to the next.  The join thus ends up costing us only O(N) time.
>
>> Ah!  Make sense.  If I'm reading your explanation right, this means
>> that we could have hit a similar pathological case on a nestloop as
>> well, just with a data ordering that is the reverse of what we have
>> here?
>
> Yeah.  It's just chance that this particular data set, with this
> particular ordering, happens to work well with a nestloop version
> of the query.  On average I'd expect nestloop to suck even more,
> because of more per-inner-tuple overhead.

I guess the real issue here is that m1.id < m2.id has to be evaluated
as a filter condition rather than a join qual. That tends to perform
poorly in general, which is why rewriting this using min() or max() or
ORDER BY .. LIMIT 1 was elsewhere recommended. I've occasionally had
cause to join on something other than equality in cases not
susceptible to such rewriting, so it would be neat to improve this
case, but it's not likely to make it to the top of my list any time
soon.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2011-06-01 20:47:36 Re: [PERFORM] Hash Anti Join performance degradation
Previous Message Tom Lane 2011-06-01 20:35:16 Re: [PERFORM] Hash Anti Join performance degradation

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2011-06-01 20:47:36 Re: [PERFORM] Hash Anti Join performance degradation
Previous Message Tom Lane 2011-06-01 20:35:16 Re: [PERFORM] Hash Anti Join performance degradation