Skip site navigation (1) Skip section navigation (2)

Building Hash Index by Presorting Tuples

From: twraney(at)comcast(dot)net
To: pgsql-hackers(at)postgresql(dot)org
Subject: Building Hash Index by Presorting Tuples
Date: 2007-07-30 22:36:12
Message-ID: 073020072236.19529.46AE67DC0004F62700004C492206998499970A020E9D999B@comcast.net (view raw or flat)
Thread:
Lists: pgsql-hackers
Hi,

We are trying to sort the index tuples before inserting them into hash buckets, to improve build speed.  

Here is our plan:

1. Build a spool that contains all the index tuples to be inserted into the buckets. - this is done.

2. sort the index tuples in the spool according to the bucket number to which they should belong. This results in accessing a bucket once and only once.  

3. For (2) to work, we need an estimate of the number of buckets. This is done.

4. After sorting the index tuples, insert them into hash in bucket order. 

Our challenge: we need to determine the final bucket number for the itup (index tuple). 

1. to do the above, we need to apply a mask to the hash value of the index tuple. first, we calculate the hash value of the index tuple. then, we calculate the mask using:

(1 << (ceiling(log 2 (Estimate of buckets needed))))-1

So, if we need 6 buckets, the mask would be 7 or binary 111.  If we needed 100 buckets, the mask would be 127 or binary 1111111.   If we AND this mask to the hash of the key, we only recognize the least   sig. bits needed to do the compare.

A 32 bit hash value may look like:  10110101001010101000010101010101

Let's say we just need 6 buckets, apply the mask 111 and we get:

10110101001010101000010101010101 (the hash value of the key)  
00000000000000000000000000000111 (the mask &)
 --------------------------------  
00000000000000000000000000000101 (the resulting bucket number = 5)

If we needed 100 buckets, the calculation would look like:

10110101001010101000010101010101 (the hash value of the key)   
00000000000000000000000001111111 (the mask &)
--------------------------------    
00000000000000000000000001010101 (the resulting bucket number = 85) 

2. however, in practice when we apply a mask of value say, 1111(binary) our resulting bucket number is not evenly distrubuted.

3. do we look for a better hash function? or can we modify the existing hash?  Comments are welcome.

-Tom


pgsql-hackers by date

Next:From: Decibel!Date: 2007-07-31 01:09:46
Subject: Re: ascii() for utf8
Previous:From: Alvaro HerreraDate: 2007-07-30 20:47:58
Subject: Re: Quick idea for reducing VACUUM contention

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group