Quick Links

Re: Hash index todo list item

From:	Shreya Bhargava <shreya_bhargav(at)yahoo(dot)com>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	pgsql-hackers(at)postgreSQL(dot)org
Subject:	Re: Hash index todo list item
Date:	2007-11-06 06:47:17
Message-ID:	148682.2744.qm@web53408.mail.re2.yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

"Note that the bottom line for the problems with hash indexes is that the
current implementation doesn't offer any advantages over btree indexes. Hash
indexes need to be not only as good of a choice as btree indexes but
significantly better a significantly better choice at least for some specific
circumstances."

We performed some probe tests on our patch on
hash index and the original btree index to compare the
performance between the two. We used a 80 million tuple table.
The table contained single integer attribute and the data
range was 0 - 100000000. (generated by a random generator).
We did the following:

1. Populate the table with 80 million tuples.
2. Create HASH index on the table.
3. clear both linux cache & psql buffers.
(exiting psql and restarting it cleared the psql buffers;
to clear linux cache, we used drop_cache command)
4. start psql
5. select on an integer in the range of values in the table.
(all test numbers were big ones, like 98934599)
6. record the time.
7. exit psql.
8. drop caches.(as described above)
9. repeat 4-8 for different numbers.
10. Drop Hash index.
11. Create Btree index and repeat 3-9.

The results are as shown below. All trials are in ms.
Count is the number of such tuples in the table.

HASH INDEX:
Number Count Trial1 Trial2 Trial3
21367898 2 152.0 129.5 131.1
95678699 2 140.6 145.6 137.5
99899799 2 148.0 147.4 152.6
97677899 2 142.0 153.5 112.0
94311123 2 137.6 146.3 141.3
67544455 2 141.6 144.6 152.7
88877677 2 122.1 123.1 130.8
88788988 2 150.2 150.0 171.7
44333344 1 119.9 116.3 117.6
34267878 1 97.5 99.9 114.8
55489781 2 115.7 117.2 145.3
97676767 1 169.5 168.5 181.7
99767564 1 142.7 133.6 132.7
21367989 4 198.0 193.2 186.7

BTREE INDEX

Number Trial1 Trial2 Trial3
21367898 145.5 145.6 150.6
95678699 177.5 170.0 176.4
99899799 175.4 181.2 185.4
97677899 136.9 149.0 130.8
94311123 179.0 175.3 166.3
67544455 161.7 162.2 170.4
88877677 138.7 135.2 149.9
88788988 165.7 179.3 186.3
44333344 166.0 172.8 184.3
34267878 159.1 168.8 147.8
55489781 183.2 195.4 185.1
97676767 147.2 143.6 132.6
99767564 167.8 178.3 162.1
21367989 232.3 238.1 216.9
>From the results obtained, the average of all the hash probes is 141.8ms, the average for btree is 168.5, a difference of about 27.The standard deviations are about 23, so this is a statistically significant difference.

Our prediction that the hash index would take on the average one
probe for about 10ms and the btree would take three probes for about 30 ms
or a difference of about 20ms was pretty well shown by the difference we
got of about 27. Hope these data points will help with some questions
about the performance differences between Hash and Btree index.

Regards,
Shreya

Gregory Stark <stark(at)enterprisedb(dot)com> wrote: "Kenneth Marshall" writes:

> On Thu, Sep 20, 2007 at 05:12:45PM -0700, Tom Raney wrote:
>
>> Using our implementation, build times and index sizes are
>> comparable with btree index build times and index sizes.
>...
> That is super! (and timely)

It is pretty super. I have a few comments to raise but don't take it to be too
negative, it sounds like this is a big step forward towards making hash
indexes valuable.

Firstly in the graphs it seems the index size graph has an exponential x-axis
but a linear y-axis. This makes it hard to see what I would expect to be
pretty much linear growth. The curves look exponential which would mean linear
growth but of course it's hard to tell.

Also, the growth in the time chart looks pretty much linear. That seems weird
since I would expect there would be a visible extra component since sort times
are n-log(n). Perhaps you need to test still larger tables to see that though.

In any case it's clear from the results you have there that the change is a
positive one and fixes a fundamental problem with the hash index build code.

Something else you should perhaps test is indexing something which is
substantially larger than hash function output. A hash function is going to
make the most sense when indexing something like strings for which you want to
avoid the long comparison costs. Especially consider testing this on a UTF8
locale with expensive comparisons (like a CJK locale for example).

Note that the bottom line for the problems with hash indexes is that the
current implementation doesn't offer any advantages over btree indexes. Hash
indexes need to be not only as good of a choice as btree indexes but
significantly better a significantly better choice at least for some specific
circumstances.

Also, if you're going to submit a patch you should check out a copy of the CVS
HEAD and work from that. I don't think there are huge differences in the area
of hash indexes though. But in most other areas you would be spending quite a
bit of time dealing details which have changed since.

Finally note that we're in the final throes of the 8.3 feature freeze.
Normally any patch submitted now would be held until 8.3 is released and
development on 8.4 is started. I could imagine an exception being made for
hash indexes since they're so moribund currently but probably not. The flip
side of that argument is that there's not much point in making an exception
for something which will only be really useful once further work is done in
the same area.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com

In response to

Re: Hash index todo list item at 2007-09-25 14:35:47 from Gregory Stark

Responses

Re: Hash index todo list item at 2007-11-06 08:27:42 from Gokulakannan Somasundaram
Re: Hash index todo list item at 2007-11-06 10:59:54 from Heikki Linnakangas
Re: Hash index todo list item at 2007-11-06 11:07:04 from Jens-Wolfhard Schicke
Re: Hash index todo list item at 2007-11-06 14:34:42 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Gregory Stark	2007-11-06 06:50:48	Re: Visibility map thoughts
Previous Message	Jeremy Drake	2007-11-06 05:57:49	Re: Open items for 8.3