Skip site navigation (1) Skip section navigation (2)

Re: HashJoin order, hash the large or small table? Postgres likes to hash the big one, why?

From: Scott Carey <scott(at)richrelevance(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-performance(at)postgresql(dot)org Performance" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: HashJoin order, hash the large or small table? Postgres likes to hash the big one, why?
Date: 2010-10-27 19:25:42
Message-ID: (view raw, whole thread or download thread mbox)
Lists: pgsql-performance
On Oct 26, 2010, at 8:48 PM, Tom Lane wrote:

> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> I'm also a bit suspicious of the fact that the hash condition has a
>> cast to text on both sides, which implies, to me anyway, that the
>> underlying data types are not text.  That might mean that the query
>> planner doesn't have very good statistics, which might mean that the
>> join selectivity estimates are wackadoo, which can apparently cause
>> this problem:
> Um ... you're guilty of the same thing as the OP, ie not showing how
> you got this example.  But I'm guessing that it was something like
> create table little as select * from generate_series(1,10) a;
> create table big as select * from generate_series(1,100000) a;
> ... wait for auto-analyze of big ...
> explain select * from little, big where little.a = big.a;
> Here, big is large enough to prod autovacuum into analyzing it,
> whereas little isn't.  So when the planner runs, it sees
> (1) big is known to have 100000 rows, and big.a is known unique;
> (2) little is estimated to have many fewer rows, but nothing is
>    known about the distribution of little.a.
> In this situation, it's going to prefer to hash big, because hash join
> behaves pretty nicely when the inner rel is uniformly distributed and
> the outer not, but not nicely at all when it's the other way round.
> It'll change its mind as soon as you analyze little, but it doesn't
> like taking a chance on an unknown distribution.  See cost_hashjoin
> and particularly estimate_hash_bucketsize.

The type of hash table will make a difference in how it behaves with skew.  Open addressing versus linking, etc. 

> I'm not convinced this explains Scott's results though --- the numbers
> he's showing don't seem to add up even if you assume a pretty bad
> distribution for the smaller rel.

Answering both messages partially:

The join is on varchar columns.  So no they are cast ::text because its from two slightly different varchar declarations to ::text.

The small relation in this case is unique on the key.  But I think postgres doesn't know that because there is a unique index on:

(id, name) and the query filters for id = 12 in the example, leaving name unique.  But postgres does not identify this as a unique condition for the key.  However, there are only about 150 distinct values of id, so even if 'name' collides sometimes across ids, there can be no more than 150 values that map to one key.

I gave a partial plan, the parent joins are sometimes anti-joins.  The general query form is two from the temp table:
An update to a main table where the unique index keys match the temp (inner join)
An insert into the main table where the unique index keys do not exist (NOT EXISTS query, results in anti-join).  The large relation is the one being inserted/updated into the main table.  
Both have the same subplan that takes all the time in the middle -- a hash that hashes the large table and probes from the small side.  I am still completely confused on how the hashjoin is calculating costs.  In one of my examples it seems to be way off.  

Why does hashjoin behave poorly when the inner relation is not uniformly distributed and the outer is?  In particular for anti-join this should be the best case scenario.

My assumption is this:  Imagine the worst possible skew on the small table.  Every value is in the same key and there are 20,000 entries in one list under one key.  The large table is in the outer relation, and probes for every element and has uniform distribution, 20 million -- one thousand rows for each of 20,000 keys.  
Inner Join: 
  The values that match the key join agains the list.  There is no faster way to join once the list is identified.  It is essentially nested loops per matching key.  1000 matches each output 20,000 rows.  If the hashjoin was reversed, then the inner relation would be the large table and the outer relation would be the small table.  There would be 20,000 matches that each output 1000 rows.
  The same thing happens, but additionally rows that match nothing are emitted.  If the relation that is always kept is the large one, it makes more sense to hash the small.
  Two relations, I'll call one "select" the other is "not exists".  The "select" relation is the one we are keeping, but only if its key does not match any key in the "not exists" relation.
Case 1:  The large uniform table is "select" and the small one "not exists"
  It is always optimal to have the small one as the inner hashed relation, no matter what the skew.  If the inner relation is the "not exists" table, only the key needs to be kept in the inner relation hash, the values or number of matches do not matter, so skew is irrelevant and actually makes it faster than even distribution of keys.
Case 2:  The small relation is the "select"
  Using the large "not exists" relation as the inner relation works well, as only the existence of a key needs to be kept.  Hashing the small "select" relation works if it is small enough.  Whenever a key matches from the outer relation, remove it from the hash, then at the end take what remains in the hash as the result.  This is also generally immune to skew.  

Am I missing something?  Is the Hash that is built storing each tuple in an open-addressed entry, and thus sensitive to key-value skew? or something like that?  If the hash only has one entry per key, with a linked list of values that match the key, I don't see how skew is a factor for hashjoin.  I am probably missing something.

> 			regards, tom lane

In response to


pgsql-performance by date

Next:From: Jon NelsonDate: 2010-10-27 19:29:08
Subject: Re: temporary tables, indexes, and query plans
Previous:From: Greg SmithDate: 2010-10-27 19:24:25
Subject: Re: AIX slow buffer reads

Privacy Policy | About PostgreSQL
Copyright © 1996-2017 The PostgreSQL Global Development Group