BUG #2930: Hash join abyssmal with many null fields.

From: "Maciej Babinski" <maciej+postgres(at)apathy(dot)killer-robot(dot)net>
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #2930: Hash join abyssmal with many null fields.
Date: 2007-01-24 19:14:50
Message-ID: 200701241914.l0OJEou1005770@wwwmaster.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs


The following bug has been logged online:

Bug reference: 2930
Logged by: Maciej Babinski
Email address: maciej+postgres(at)killer-robot(dot)net
PostgreSQL version: 8.2
Operating system: CentOS 4.4 (linux 2.6.9)
Description: Hash join abyssmal with many null fields.
Details:

Hash join of columns with many null fields is very slow unless the null
fields are commented out. Steps to reproduce:

CREATE TABLE a (id integer, join_id integer);
CREATE TABLE b (id integer, join_id integer);

INSERT INTO a (id) SELECT generate_series(1,10000);
INSERT INTO b (id) SELECT generate_series(1,10000);

ANALYZE a;
ANALYZE b;

EXPLAIN ANALYZE SELECT * FROM a LEFT JOIN b ON a.join_id = b.join_id;
EXPLAIN ANALYZE SELECT * FROM a LEFT JOIN b ON a.join_id = b.join_id AND
b.join_id IS NOT NULL;

Here are the results of the two explains:

test1=# EXPLAIN ANALYZE SELECT * FROM a LEFT JOIN b ON a.join_id =
b.join_id;
EXPLAIN ANALYZE SELECT * FROM a LEFT JOIN b ON a.join_id = b.join_id AND
b.join_id IS NOT NULL;
QUERY PLAN
----------------------------------------------------------------------------
--------------------------------------
Hash Left Join (cost=170.00..1590.01 rows=10000 width=16) (actual
time=23.100..9451.778 rows=10000 loops=1)
Hash Cond: (a.join_id = b.join_id)
-> Seq Scan on a (cost=0.00..145.00 rows=10000 width=8) (actual
time=0.014..11.071 rows=10000 loops=1)
-> Hash (cost=145.00..145.00 rows=10000 width=8) (actual
time=21.999..21.999 rows=10000 loops=1)
-> Seq Scan on b (cost=0.00..145.00 rows=10000 width=8) (actual
time=0.011..10.679 rows=10000 loops=1)
Total runtime: 9460.944 ms
(6 rows)

test1=# EXPLAIN ANALYZE SELECT * FROM a LEFT JOIN b ON a.join_id = b.join_id
AND b.join_id IS NOT NULL;
QUERY PLAN
----------------------------------------------------------------------------
--------------------------------
Hash Left Join (cost=145.00..340.01 rows=10000 width=16) (actual
time=1.970..32.452 rows=10000 loops=1)
Hash Cond: (a.join_id = b.join_id)
-> Seq Scan on a (cost=0.00..145.00 rows=10000 width=8) (actual
time=0.033..10.326 rows=10000 loops=1)
-> Hash (cost=145.00..145.00 rows=1 width=8) (actual time=1.924..1.924
rows=0 loops=1)
-> Seq Scan on b (cost=0.00..145.00 rows=1 width=8) (actual
time=1.922..1.922 rows=0 loops=1)
Filter: (join_id IS NOT NULL)
Total runtime: 41.495 ms
(7 rows)

Note that the two ON clauses are logically identical, but the second query
is much faster than the first. The first query will slow down exponentially
with the number of rows: 200,000 rows yielded a runtime of over ten hours.
Thanks!

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Feng Chen 2007-01-24 19:25:12 Postgres Stored Procedure Call Function Return Type OID Caching Problem
Previous Message Tom Lane 2007-01-24 19:04:23 Re: Function returns wrong data after datatype change