Re: Hash Anti Join performance degradation

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: panam <panam(at)gmx(dot)net>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hash Anti Join performance degradation
Date: 2011-05-24 20:38:33
Message-ID: 15099.1306269513@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

panam <panam(at)gmx(dot)net> writes:
> In my DB, there is a large table containing messages and one that contains
> message boxes.
> Messages are assigned to boxes via a child parent link m->b.
> In order to obtain the last message for a specific box, I use the following
> SQL:

> SELECT m1.id FROM message m1 LEFT JOIN message m2 ON (m1.box_id = m2.box_id
> AND m1.id < m2.id) WHERE m2.id IS NULL AND m1.box_id = id;

BTW, this query doesn't actually match the EXPLAIN outputs...

> So from my limited experience, the only significant difference I see is that
> the Hash Anti Join takes a lot more time in plan 2, but I do not understand
> why.

Whatever's going on is below the level that EXPLAIN can show. I can
think of a couple of possibilities:

1. The "extra" rows in the slower case all manage to come out to the
same hash value, or some very small number of distinct hash values, such
that we spend a lot of time searching a single hash chain. But it's
hard to credit that adding 0.4% more rows could result in near 100x
slowdown, no matter how bad their distribution.

2. There's some inefficiency in the use of the temp files, though again
it's far from clear why your two cases would be noticeably different
there. Possibly enabling log_temp_files would tell you something useful.

One other thing I'm not following is how come it's using hash temp files
at all, when you claim in your later message that you've got work_mem
set to 1GB. It should certainly not take more than a couple meg to hold
20K rows at 16 payload bytes per row. You might want to check whether
that setting actually took effect.

regards, tom lane

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2011-05-24 20:38:52 Re: about EDITOR_LINENUMBER_SWITCH
Previous Message Peter Eisentraut 2011-05-24 20:38:31 Re: about EDITOR_LINENUMBER_SWITCH

Browse pgsql-performance by date

  From Date Subject
Next Message Greg Smith 2011-05-24 21:46:06 Re: Performance degradation of inserts when database size grows
Previous Message Tomas Vondra 2011-05-24 19:20:52 Re: Performance degradation of inserts when database size grows