Re: Hash index todo list item

From: Tom Raney <twraney(at)comcast(dot)net>
To: Kenneth Marshall <ktm(at)rice(dot)edu>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Hash index todo list item
Date: 2007-09-26 05:43:18
Message-ID: 46F9F176.1060705@comcast.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Kenneth Marshall wrote:
> On Tue, Sep 25, 2007 at 03:35:47PM +0100, Gregory Stark wrote:
>
>> "Kenneth Marshall" <ktm(at)rice(dot)edu> writes:
>>
>>
>>> On Thu, Sep 20, 2007 at 05:12:45PM -0700, Tom Raney wrote:
>>>
>>>
>>>> Using our implementation, build times and index sizes are
>>>> comparable with btree index build times and index sizes.
>>>>
>>> ...
>>> That is super! (and timely)
>>>
>> It is pretty super. I have a few comments to raise but don't take it to be too
>> negative, it sounds like this is a big step forward towards making hash
>> indexes valuable.
>>
>> Firstly in the graphs it seems the index size graph has an exponential x-axis
>> but a linear y-axis. This makes it hard to see what I would expect to be
>> pretty much linear growth. The curves look exponential which would mean linear
>> growth but of course it's hard to tell.
>>
>> Also, the growth in the time chart looks pretty much linear. That seems weird
>> since I would expect there would be a visible extra component since sort times
>> are n-log(n). Perhaps you need to test still larger tables to see that though.
>>
>> In any case it's clear from the results you have there that the change is a
>> positive one and fixes a fundamental problem with the hash index build code.
>>
>> Something else you should perhaps test is indexing something which is
>> substantially larger than hash function output. A hash function is going to
>> make the most sense when indexing something like strings for which you want to
>> avoid the long comparison costs. Especially consider testing this on a UTF8
>> locale with expensive comparisons (like a CJK locale for example).
>>
>> Note that the bottom line for the problems with hash indexes is that the
>> current implementation doesn't offer any advantages over btree indexes. Hash
>> indexes need to be not only as good of a choice as btree indexes but
>> significantly better a significantly better choice at least for some specific
>> circumstances.
>>
>> Also, if you're going to submit a patch you should check out a copy of the CVS
>> HEAD and work from that. I don't think there are huge differences in the area
>> of hash indexes though. But in most other areas you would be spending quite a
>> bit of time dealing details which have changed since.
>>
>> Finally note that we're in the final throes of the 8.3 feature freeze.
>> Normally any patch submitted now would be held until 8.3 is released and
>> development on 8.4 is started. I could imagine an exception being made for
>> hash indexes since they're so moribund currently but probably not. The flip
>> side of that argument is that there's not much point in making an exception
>> for something which will only be really useful once further work is done in
>> the same area.
>>
>>
>
> Although I am very excited about this patch, I do not see any real value
> in including it in 8.3. As you mention, we need to to have a hash index
> implementation that outperforms btree in some problem regime and that is
> currently not the case. I have just recently started the process of
> gathering ideas and having discussions on various approaches to making
> hash indexes more performant and we have a number of ideas on which to
> start our investigation. I do think that this patch will make the testing
> and evaluation, that will be needed to truly improve the hash index, much
> much easier.
>
> Regards,
> Ken
>
>
We're glad to contribute and be a part of Postgres. The patch has been
posted to pgsql-patches(at)postgresql(dot)org(dot)

Speeding up the creation time of hash indexes on non-trivial relations
was our goal. This will allow some interesting performance tests of the
hash index on very large relations. It may be that the near constant
lookup time of the hash index outperforms the Btree index for some large
data sets and for certain types of data and distributions.

Sincerely,
Tom Raney

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2007-09-26 06:03:33 Re: plpgsql TABLE patch
Previous Message Luke Lonergan 2007-09-26 05:40:07 Re: top for postgresql (ptop?)