From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | pgsql-hackers(at)postgreSQL(dot)org |
Cc: | Joseph Shraibman <jks(at)selectacast(dot)net> |
Subject: | Hash joins vs small-integer join values |
Date: | 2007-05-31 19:38:05 |
Message-ID: | 11487.1180640285@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
I was idly thinking about Joseph Shraibman's problem here:
http://archives.postgresql.org/pgsql-general/2007-05/msg01011.php
in which a large hash join seemed to be blowing out memory.
By chance I tried the following test case:
js=# create table ml (jid int);
CREATE TABLE
js=# insert into ml select random()*1000 from generate_series(1,185391404);
INSERT 0 185391404
js=# create table tempjr (id int);
CREATE TABLE
js=# insert into tempjr select random()*1000 from generate_series(1,60000);
INSERT 0 60000
js=# analyze ml;
ANALYZE
js=# select count(*) from tempjr join ml on (jid=id) group by jid;
Since I hadn't remembered to increase work_mem beyond the default, this
set up a hash join with 4111 buckets in each of 8192 batches, which
didn't seem too awfully unreasonable, so I let it go. Imagine my horror
as I watched it stuff all 185 million ml rows into batch 4365.
Naturally, when it got to trying to process that batch, the in-memory
hashtable blew out real good. I'm not certain this is what happened to
Joseph, since I don't know the stats of his jid column, but in any case
it's got to be fixed. Hash join is a probabilistic algorithm, so there
will always be some input distributions for which it sucks, but I don't
think we can tolerate "uniformly distributed on the integers 0-N" as
being one of them.
The problem comes from the rather simplistic assignment of bucket and
batch numbers in ExecHashGetBucketAndBatch():
* Note: on-the-fly increases of nbatch must not change the bucket number
* for a given hash code (since we don't move tuples to different hash
* chains), and must only cause the batch number to remain the same or
* increase. Our algorithm is
* bucketno = hashvalue MOD nbuckets
* batchno = (hashvalue DIV nbuckets) MOD nbatch
* where nbuckets should preferably be prime so that all bits of the
* hash value can affect both bucketno and batchno.
* nbuckets doesn't change over the course of the join.
This would be fine if the hashvalues were reasonably randomly
distributed over all uint32 values, but take a look at hashint4 ---
it's just a one's-complement:
Datum
hashint4(PG_FUNCTION_ARGS)
{
PG_RETURN_UINT32(~PG_GETARG_UINT32(0));
}
Two inputs that differ by 1 will have hash values also differing by 1.
Therefore, in my test case with 4111 buckets, consecutive ranges of 4111
input values map to the same batch --- different buckets in the batch,
but the same batch. My example with inputs 0..999 would have mapped to
either 1 or 2 batches depending on luck. With a more realistic
work_mem, nbuckets would have been larger, making this problem worse not
better.
8.1 and up are broken this way; in 8.0 and before we were calculating
the batch number in a different way that doesn't seem vulnerable to
this particular failure mode.
Arguably, the problem here is a chintzy hash function, and we should fix
it by making the integer hash functions use hash_any(). I'm inclined to
do that for 8.3. The problem is that this is not a back-patchable
answer, because changing the hash functions would corrupt existing hash
indexes. The best idea I can come up with for the back branches is
to make ExecHashGetBucketAndBatch do hash_any internally, say
if (nbatch > 1)
{
*bucketno = hashvalue % nbuckets;
/* since nbatch is a power of 2, can do MOD by masking */
- *batchno = (hashvalue / nbuckets) & (nbatch - 1);
+ *batchno = hash_any(&hashvalue, sizeof(int32)) & (nbatch - 1);
}
else
{
*bucketno = hashvalue % nbuckets;
*batchno = 0;
}
Comments, better ideas?
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Joshua D. Drake | 2007-05-31 21:03:04 | EXPLAIN feature request |
Previous Message | Martijn van Oosterhout | 2007-05-31 18:59:53 | Re: SQLF Optimization question |