Optimizer refuses to hash join

From: Stan Bielski <bielski(at)ece(dot)cmu(dot)edu>
To: pgsql-performance(at)postgresql(dot)org
Subject: Optimizer refuses to hash join
Date: 2004-07-27 19:38:05
Message-ID: Pine.LNX.3.96L.1040727152629.26493I-100000@elysium.pdl.cmu.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hello,

I having a great deal of difficulty getting postgres to do a hash join.
Even if I disable nestloop and mergejoin in postgres.conf, the optimizer
still refuses to select hash join. This behavior is killing my
performance.

Postgres version is 7.3.2 and relevant tables are vacuum analyzed.

Here's an overview of what I'm doing:

I have one table of network logs ordered by time values. The other table
is a set of hosts (approximately 60) that are infected by a worm. I want
to do this query on the dataset:

standb=# explain SELECT count (allflow_tv_sobig.tv_s) FROM
allflow_tv_sobig, blaster_set WHERE allflow_tv_sobig.src =
blaster_set.label AND allflow_tv_sobig.tv_s >= 1060101118::bigint and
allflow_tv_sobig.tv_s < 1060187518::bigint;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=185785.06..185785.06 rows=1 width=32)
-> Merge Join (cost=174939.71..184986.38 rows=319472 width=32)
Merge Cond: ("outer".label = "inner".src)
-> Index Scan using blaster_set_x on blaster_set
(cost=0.00..3.67 rows=66 width=12)
-> Sort (cost=174939.71..178073.92 rows=1253684 width=20)
Sort Key: allflow_tv_sobig.src
-> Index Scan using allflow_tv_sobig_x on allflow_tv_sobig
(cost=0.00..47955.63 rows=1253684 width=20)
Index Cond: ((tv_s >= 1060101118::bigint) AND (tv_s <
1060187518::bigint))
(8 rows)

Basically I just want to use the smaller table as a filtering mechanism so
that I only get resulted for hosts in that table. Rather than do the
sensible thing, which is scan the list of infected hosts, then scan the
traffic table and ignore entries that aren't in the first list, the
optimizer insists on SORTING the table of network traffic according to
source address. Considering that this table is very large, these queries
are taking forever.

Doing it in a nested loop, while it doesn't require sorting, still takes a
very long time as well.

Is there anyway that I can force the optimizer to do this the right way,
aside from adding each IP manually to a disgustingly bloated 'where'
clause?

Thanks,
-S

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Stephan Szabo 2004-07-28 03:45:34 Re: best way to fetch next/prev record based on index
Previous Message Merlin Moncure 2004-07-27 19:20:49 Re: best way to fetch next/prev record based on index