Re: [PERFORM] Hash Anti Join performance degradation

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>, panam <panam(at)gmx(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PERFORM] Hash Anti Join performance degradation
Date: 2011-06-01 20:25:56
Message-ID: 24247.1306959956@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Tue, May 31, 2011 at 11:47 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> I'd just write it off as being a particularly stupid way to find the
>> max(), except I'm not sure why deleting just a few thousand rows
>> improves things so much. It looks like it ought to be an O(N^2)
>> situation, so the improvement should be noticeable but not amazing.

> Yeah, this is what I was getting at, though perhaps I didn't say it
> well. If the last 78K rows were particularly pathological in some
> way, that might explain something, but as far as one can see they are
> not a whole heck of a lot different from the rest of the data.

Well, I wasted a lot more time on this than I should have, but the
answer is: it's a pathological data ordering.

The given data file loads the message rows approximately in "id" order,
and in fact once you lop off the ones with id > 2550000, it's
sufficiently in order that the highest id is also physically last in
the table, at least for all of the box_ids that have large numbers of
entries. Now, the tested query plan loads the hash table like this:

-> Hash (cost=13685.86..13685.86 rows=28511 width=16) (actual time=176.286..176.286 rows=211210 loops=1)
Buckets: 4096 Batches: 1 Memory Usage: 9901kB
-> Bitmap Heap Scan on message m2 (cost=537.47..13685.86 rows=28511 width=16) (actual time=23.204..124.624 rows=211210 loops=1)
Recheck Cond: (box_id = $1)
-> Bitmap Index Scan on "message_box_Idx" (cost=0.00..530.34 rows=28511 width=0) (actual time=21.974..21.974 rows=211210 loops=1)
Index Cond: (box_id = $1)

Because of the way that a bitmap heap scan works, the rows are
guaranteed to be loaded into the hash table in physical order, which
means (in the fast case) that the row with the largest "id" value gets
loaded last. And because ExecHashTableInsert pushes each row onto the
front of its hash chain, that row ends up at the front of the hash
chain. Net result: for all the outer rows that aren't the one with
maximum id, we get a joinqual match at the very first entry in the hash
chain. Since it's an antijoin, we then reject that outer row and go
on to the next. The join thus ends up costing us only O(N) time.

However, with the additional rows in place, there are a significant
number of outer rows that don't get a match at the first hash chain
entry, and we're spending more like O(N^2) time. I instrumented things
for the specific case of box_id = 69440, which is the most common
box_id, and got these results:

2389 got match of join quals at probe 208077
1 got match of join quals at probe 1
175 got match of join quals at probe 208077
273 got match of join quals at probe 1
21 got match of join quals at probe 208077
1 got match of join quals at probe 1
24 got match of join quals at probe 208077
6 got match of join quals at probe 1
157 got match of join quals at probe 208077
1 got match of join quals at probe 1
67 got match of join quals at probe 208077
18 got match of join quals at probe 1
1 generate null-extended tuple after 211211 probes
208075 got match of join quals at probe 1
1 got match of join quals at probe 208077

(This is a "uniq -c" summary of a lot of printfs, so the first column
is the number of consecutive occurrences of the same printout.) Even
though a large majority of the outer rows still manage to find a match
at the first probe, there remain about 2800 that don't match there,
because they've got pretty big ids, and so they traipse through the hash
chain until they find the genuinely largest id, which is unfortunately
way down there ---- the 208077'th chain entry in fact. That results in
about half a billion more ExecScanHashBucket and ExecQual calls than
occur in the easy case (and that's just for this one box_id).

So it's not that the full data set is pathologically bad, it's that the
reduced set is pathologically good. O(N^2) performance is what you
should expect for this query, and that's what you're actually getting
with the full data set.

Also, I noted earlier that performance seemed a good deal better with a
NestLoop plan. The reason for that is that NestLoop doesn't have the
reversal of inner row ordering that's caused by prepending entries to
the hash chain, so the very largest row id isn't 208077 entries into the
list for it, but only 211211-208077 = 3134 entries in. But it still
manages to eliminate most outer rows at the first probe, because there's
a fairly large value at that end of the dataset too.

I don't see anything much that we could or should do about this. It
just depends on the order in which things appear in the hash chain,
and even if we fooled around with that ordering, we'd only be moving
the improbably-lucky behavior from one case to some other case.

We do need to look into putting a CHECK_FOR_INTERRUPTS call in here
somewhere, though. I'm inclined to think that right before the
ExecScanHashBucket is the best place. The reason that nest and merge
joins don't show a comparable non-responsiveness to cancels is that they
always call a child plan node at the equivalent place, and ExecProcNode
has got a CHECK_FOR_INTERRUPTS. So we ought to check for interrupts
at the point of "fetching a tuple from the inner child plan", and
ExecScanHashBucket is the equivalent thing in this logic. Cedric's
suggestion of putting it before the switch would get the job done, but
it would result in wasting cycles during unimportant transitions from
one state machine state to another.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2011-06-01 20:31:34 Re: [PERFORM] Hash Anti Join performance degradation
Previous Message Radosław Smogura 2011-06-01 20:00:31 BLOB support

Browse pgsql-performance by date

  From Date Subject
Next Message Robert Haas 2011-06-01 20:31:34 Re: [PERFORM] Hash Anti Join performance degradation
Previous Message CS DBA 2011-06-01 20:14:10 Problem query