Building Hash Index by Presorting Tuples

From: twraney(at)comcast(dot)net
To: pgsql-hackers(at)postgresql(dot)org
Subject: Building Hash Index by Presorting Tuples
Date: 2007-07-30 22:36:12
Message-ID: 073020072236.19529.46AE67DC0004F62700004C492206998499970A020E9D999B@comcast.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

We are trying to sort the index tuples before inserting them into hash buckets, to improve build speed.

Here is our plan:

1. Build a spool that contains all the index tuples to be inserted into the buckets. - this is done.

2. sort the index tuples in the spool according to the bucket number to which they should belong. This results in accessing a bucket once and only once.

3. For (2) to work, we need an estimate of the number of buckets. This is done.

4. After sorting the index tuples, insert them into hash in bucket order.

Our challenge: we need to determine the final bucket number for the itup (index tuple).

1. to do the above, we need to apply a mask to the hash value of the index tuple. first, we calculate the hash value of the index tuple. then, we calculate the mask using:

(1 << (ceiling(log 2 (Estimate of buckets needed))))-1

So, if we need 6 buckets, the mask would be 7 or binary 111. If we needed 100 buckets, the mask would be 127 or binary 1111111. If we AND this mask to the hash of the key, we only recognize the least sig. bits needed to do the compare.

A 32 bit hash value may look like: 10110101001010101000010101010101

Let's say we just need 6 buckets, apply the mask 111 and we get:

10110101001010101000010101010101 (the hash value of the key)
00000000000000000000000000000111 (the mask &)
--------------------------------
00000000000000000000000000000101 (the resulting bucket number = 5)

If we needed 100 buckets, the calculation would look like:

10110101001010101000010101010101 (the hash value of the key)
00000000000000000000000001111111 (the mask &)
--------------------------------
00000000000000000000000001010101 (the resulting bucket number = 85)

2. however, in practice when we apply a mask of value say, 1111(binary) our resulting bucket number is not evenly distrubuted.

3. do we look for a better hash function? or can we modify the existing hash? Comments are welcome.

-Tom

Browse pgsql-hackers by date

  From Date Subject
Next Message Decibel! 2007-07-31 01:09:46 Re: ascii() for utf8
Previous Message Alvaro Herrera 2007-07-30 20:47:58 Re: Quick idea for reducing VACUUM contention