Re: Horribly slow hash join

From: Greg Stark <gsstark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>, Bruno Wolff III <bruno(at)wolff(dot)to>, "Jim C(dot) Nasby" <jim(at)nasby(dot)net>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Horribly slow hash join
Date: 2004-05-04 22:15:25
Message-ID: 87llk7yh36.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Modding by a *non* power of 2 (esp. a prime) mixes the bits quite well,
> and is likely faster than any multiple-instruction way to do the same.
>
> The quoted article seems to be by someone who has spent a lot of time
> counting assembly cycles and none at all reading the last thirty years
> worth of CS literature. Knuth's treatment of hashing has some actual
> math to it...

[incidentally, I just found that the quoted article was independently found by
Bruce Momjian who found it convincing enough to convert the hash_any table
over to it two years ago]

I just reviewed Knuth as well as C.L.R. and several papers from CACM and
SIGMOD.

It seems we have three options:

mod():

Pro: empirical research shows it the best algorithm for avoiding collisions

Con: requires the hash table be a prime size and far from a power of two.
This is inconvenient to arrange for dynamic tables as used in postgres.

multiplication method (floor(tablesize * remainder(x * golden ratio)))

Pro: works with any table size

Con: I'm not clear if it can be done without floating point arithmetic.
It seems like it ought to be possible though.

Universal hashing:

Pro: won't create gotcha situations where the distribution of data suddenly
and unexpectedly creates unexpected performance problems.

Con: more complex.

It would be trivial to switch the implementation from mod() to the
multiplicative method which is more suited to postgres's needs. However
universal hashing has some appeal. I hate the idea that a hash join could
perform perfectly well one day and suddenly become pessimal when new data is
loaded.

In a sense universal hashing is less predictable. For a DSS system that could
be bad. A query that runs fine every day might fail one day in an
unpredictable way even though the data is unchanged.

But in another sense it would be more predictable in that if you run the query
a thousand times the average performance would be very close to the expected
regardless of what the data is. Whereas more traditional algorithms have some
patterns of data that will consistently perform badly.

It's probably not worth it but postgres could maybe even be tricky and pretest
the parameters against the common values in the statistics table, generating
new ones if they fail to generate a nice distribution. That doesn't really
guarantee anything though, except that those common values would at least be
well distributed to start with.

--
greg

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Cyrille Bonnet 2004-05-04 23:34:41 very high CPU usage in "top", but not in "mpstat"
Previous Message James Thornton 2004-05-04 19:38:47 Re: Recommended File System Configuration