Skip site navigation (1) Skip section navigation (2)

Re: Hash Anti Join performance degradation

From: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
To: panam <panam(at)gmx(dot)net>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hash Anti Join performance degradation
Date: 2011-05-26 18:13:22
Message-ID: (view raw, whole thread or download thread mbox)
Lists: pgsql-hackerspgsql-performance
2011/5/26 panam <panam(at)gmx(dot)net>:
> Hi all,
> Cédric Villemain-3 wrote:
>> without explaining further why the antijoin has bad performance
>> without cluster, I wonder why you don't use this query :
>>                   max(
>> FROM box b, message m
>> WHERE m.box_id =
>> looks similar and fastest.
> I actually did use a similar strategy in the meantime (during my problem
> with the "left join" query we are talking about here all the time) for
> mitigation.
> It was
> SELECT MAX( FROM event_message e WHERE e.box_id = id
> and it performed worse in comparison to the "left join" query in the general
> case (i.e. before my problems began).
> At the end of this post is an explanation why I think I cannot use the
> solution you suggested above.
> Kevin Grittner wrote:
>>  Each connection can allocate work_mem, potentially several times.
>> On a machines without hundreds of GB of RAM, that pair of settings
>> could cause severe swapping.
> Indeed, thanks for the warning. These settings are not for production but to
> exclude a performance degradation because of small cache sizes.
> Kevin Grittner wrote:
>> I think you would need a left join to actually get identical
>> results:
>> SELECT, max(
>>   FROM box b
>>   LEFT JOIN message m ON m.box_id =
>>   GROUP BY;
>> But yeah, I would expect this approach to be much faster.  Rather
>> easier to understand and harder to get wrong, too.
> Correct, it is much faster, even with unclustered ids.
> However, I think I cannot use it because of the way that query is generated
> (by hibernate).
> The (simplyfied) base query is just
> SELECT from box
> the subquery
> (SELECT FROM message m1
>   LEFT JOIN message m2
>      ON (m1.box_id = m2.box_id  AND < )
>   WHERE IS NULL AND m1.box_id = as lastMessageId
> is due to a hibernate formula (containing more or less plain SQL) to
> determine the last message id for that box. It ought to return just one row,
> not multiple. So I am constrained to the subquery in all optimization
> attemps (I cannot combine them as you did), at least I do not see how. If
> you have an idea for a more performant subquery though, let me know, as this
> can easily be replaced.

In production, if you have a decent IO system, you can lower
random_page_cost and it may be faster using index (by default, with
the use case you provided it choose a seqscan). It can be a bit tricky
if you have to lower random_page_cost so much that it destroy others
query plan but increase the perf of the current one. if it happens,
post again :) (sometime need to change other cost parameters but it
needs to be handle with care)

I am not an hibernate expert, but I'll surprised if you can not drive
hibernate to do what you want.

> Thanks for your help and suggestions
> panam
> --
> View this message in context:
> Sent from the PostgreSQL - performance mailing list archive at
> --
> Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
> To make changes to your subscription:

Cédric Villemain               2ndQuadrant     PostgreSQL : Expertise, Formation et Support

In response to


pgsql-performance by date

Next:From: Tory M BlueDate: 2011-05-26 22:34:55
Subject: Performance block size.
Previous:From: panamDate: 2011-05-26 18:04:35
Subject: Re: Hash Anti Join performance degradation

pgsql-hackers by date

Next:From: Joshua D. DrakeDate: 2011-05-26 18:21:42
Subject: #PgWest 2011: CFP now open
Previous:From: Kevin GrittnerDate: 2011-05-26 18:05:55
Subject: Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

Privacy Policy | About PostgreSQL
Copyright © 1996-2017 The PostgreSQL Global Development Group