Skip site navigation (1) Skip section navigation (2)

Re: BUG #2930: Hash join abyssmal with many null fields.

From: Maciej Babinski <maciej(at)killer-robot(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Maciej Babinski <maciej(at)apathy(dot)killer-robot(dot)net>,pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #2930: Hash join abyssmal with many null fields.
Date: 2007-01-26 17:47:45
Message-ID: 45BA3EC1.1050307@killer-robot.net (view raw or flat)
Thread:
Lists: pgsql-bugs
Tom Lane wrote:
> Maciej Babinski <maciej(at)apathy(dot)killer-robot(dot)net> writes:
>> Tom Lane wrote:
>>> I see no bug here.  AFAICT your "much faster" query gets that way by
>>> having eliminated all the candidate join rows on the B side.
> 
>> The additional clause eliminates no rows beyond what the existing
>> clause would.  Any row eliminated by "b.join_id IS NOT NULL" could not
>> possibly have satisfied "a.join_id = b.join_id".
> 
> Hmm.  You assume that the = operator is strict, which is probably true,
> but the hash join code isn't assuming that.
> 

Yes, both queries return identical results.

> It might be worth checking for the case, though.  What's happening,
> since we go ahead and put the null rows into the hash table, is that
> they all end up in the same hash chain because they all get hash code 0.
> And then that very long chain gets searched for each null outer row.
> If we knew the join operator is strict we could discard null rows
> immediately on both sides.
> 
>> Please note that if the join columns are not null, but still produce
>> no matches for the join, the results are fast without the need for an
>> extra clause in the join:
> 
> Yeah, because the rows get reasonably well distributed into different
> hash buckets.  The optimizer will avoid a hash if it sees the data is
> not well-distributed, but IIRC it's not considering nulls when it
> makes that decision.


You're right. Filling one join column with '1', and the other with '2'
yields a plan without a hash join. Something that may be worth noting is
that while my initial problem case (200,000 rows in more complicated 
tables),
as well as the sample test case I put together (10,000 rows in simple 
tables)
both yield this result, changing my test case to use 20,000 rows produces
a plan that avoids hash joins as well:

maciej_test=# EXPLAIN ANALYZE SELECT * FROM a JOIN b ON a.join_id = 
b.join_id;
                                                     QUERY PLAN
------------------------------------------------------------------------------------------------------------------
  Merge Join  (cost=3435.54..3635.55 rows=1 width=16) (actual 
time=100.822..100.822 rows=0 loops=1)
    Merge Cond: ("outer".join_id = "inner".join_id)
    ->  Sort  (cost=1717.77..1767.77 rows=20000 width=8) (actual 
time=65.401..86.333 rows=20000 loops=1)
          Sort Key: a.join_id
          ->  Seq Scan on a  (cost=0.00..289.00 rows=20000 width=8) 
(actual time=0.005..21.119 rows=20000 loops=1)
    ->  Sort  (cost=1717.77..1767.77 rows=20000 width=8) (never executed)
          Sort Key: b.join_id
          ->  Seq Scan on b  (cost=0.00..289.00 rows=20000 width=8) 
(never executed)
  Total runtime: 101.891 ms
(9 rows)

Time: 103.051 ms

Thanks for you time,
Maciej Babinski

In response to

pgsql-bugs by date

Next:From: Tom LaneDate: 2007-01-26 20:22:54
Subject: Re: btree_delete_page_redo: lost target page
Previous:From: Tom LaneDate: 2007-01-26 17:22:26
Subject: Re: BUG #2930: Hash join abyssmal with many null fields.

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group