Re: Hash Anti Join performance degradation

From: panam <panam(at)gmx(dot)net>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hash Anti Join performance degradation
Date: 2011-05-26 16:08:07
Message-ID: 1306426087061-4429125.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

Hi all,

Cédric Villemain-3 wrote:
>
> without explaining further why the antijoin has bad performance
> without cluster, I wonder why you don't use this query :
>
> SELECT b.id,
> max(m.id)
> FROM box b, message m
> WHERE m.box_id = b.id
> GROUP BY b.id;
>
> looks similar and fastest.
>
I actually did use a similar strategy in the meantime (during my problem
with the "left join" query we are talking about here all the time) for
mitigation.
It was
SELECT MAX(e.id) FROM event_message e WHERE e.box_id = id
and it performed worse in comparison to the "left join" query in the general
case (i.e. before my problems began).
At the end of this post is an explanation why I think I cannot use the
solution you suggested above.

Kevin Grittner wrote:
>
> Each connection can allocate work_mem, potentially several times.
> On a machines without hundreds of GB of RAM, that pair of settings
> could cause severe swapping.
>
Indeed, thanks for the warning. These settings are not for production but to
exclude a performance degradation because of small cache sizes.

Kevin Grittner wrote:
>
> I think you would need a left join to actually get identical
> results:
>
> SELECT b.id, max(m.id)
> FROM box b
> LEFT JOIN message m ON m.box_id = b.id
> GROUP BY b.id;
>
> But yeah, I would expect this approach to be much faster. Rather
> easier to understand and harder to get wrong, too.
>
>
Correct, it is much faster, even with unclustered ids.
However, I think I cannot use it because of the way that query is generated
(by hibernate).
The (simplyfied) base query is just

SELECT b.id from box

the subquery

(SELECT m1.id FROM message m1
LEFT JOIN message m2
ON (m1.box_id = m2.box_id AND m1.id < m2.id )
WHERE m2.id IS NULL AND m1.box_id = b.id) as lastMessageId

is due to a hibernate formula (containing more or less plain SQL) to
determine the last message id for that box. It ought to return just one row,
not multiple. So I am constrained to the subquery in all optimization
attemps (I cannot combine them as you did), at least I do not see how. If
you have an idea for a more performant subquery though, let me know, as this
can easily be replaced.

Thanks for your help and suggestions
panam

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Hash-Anti-Join-performance-degradation-tp4420974p4429125.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2011-05-26 16:14:38 Re: "errno" not set in case of "libm" functions (HPUX)
Previous Message Euler Taveira de Oliveira 2011-05-26 16:00:05 Re: BUG #6041: Unlogged table was created bad in slave node

Browse pgsql-performance by date

  From Date Subject
Next Message Dave Johansen 2011-05-26 16:21:45 Re: LIMIT and UNION ALL
Previous Message Merlin Moncure 2011-05-26 16:02:16 Re: The shared buffers challenge