Re: Hash index todo list item

From: "Ben Tilly" <btilly(at)gmail(dot)com>
To: "Gregory Stark" <stark(at)enterprisedb(dot)com>
Cc: "Kenneth Marshall" <ktm(at)rice(dot)edu>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Hash index todo list item
Date: 2007-09-04 00:20:34
Message-ID: acc274b30709031720k5d16b8fh3e7188e461129967@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 9/3/07, Gregory Stark <stark(at)enterprisedb(dot)com> wrote:
>
> "Kenneth Marshall" <ktm(at)rice(dot)edu> writes:
>
> > On Sun, Sep 02, 2007 at 10:41:22PM -0400, Tom Lane wrote:
> >> Kenneth Marshall <ktm(at)rice(dot)edu> writes:
> >> > ... This is the rough plan. Does anyone see anything critical that
> >> > is missing at this point?
> >>
> >> Sounds pretty good. Let me brain-dump one item on you: one thing that
> >> hash currently has over btree is the ability to handle index items up
> >> to a full page. Now, if you go with a scheme that only stores hash
> >> codes and not the underlying data, you can not only handle that but
> >> improve on it;
>
> I think that would be a big selling point for hash indexes. It would let you
> index even toasted data which are larger than a page. I'm not sure whether you
> can make it work for unique indexes though. But for non-unique indexes I think
> it would be a solid win and mean you cover a set of use cases quite distinct
> from btree indexes.
>
> > - Hash lookup is O(1) while btree is O(logN).
>
> That's not really true. There's a tradeoff between insertion time and lookup
> time. In order to get O(1) lookups you need to work pretty hard to maintain
> the hash table including spending a lot of time reorganizing it when you grow
> it. If you don't want to spend the time on inserts then you end up with
> buckets and the hash table is basically just a linear speedup to whatever
> algorithm you use to scan the buckets.

These facts notwithstanding, average insert performance remains O(1)
if you grow the hash exponentially every time it needs to be grown.
Suppose, for example, that you use a power of 2 arrangement. Then the
worst case scenario, right after a split, is that all of your keys had
to be inserted, all had to be moved once, half had to be moved twice,
a quarter 3 times, etc. So the ratio of moves to keys is 1 + 1/2 +
1/4 + ... which is a well-known geometric series converging on 2.

True, when you cross the threshold a lot of work needs to be done.
Life would be simpler if you could just put up a lock while you split
the hash. You can't do that for a busy transactional database though.
But if you want to be clever about it, you build into your hash
implementation the intelligence to be able to have 1 or 2 hash
locations to search. When they are both present, all inserts go into
one of them, all deletes and updates are performed against both. Then
you're able to have a background job reorganize your hash while the
database continues to use it.

> > - What about multi-column indexes? The current implementation
> > only supports 1 column.
>
> That seems kind of weird. It seems obvious that you mix the three hashes
> together which reduces it to the solved problem.

That raises a very random thought. One of the nicer features of
Oracle is the ability to have function-based indexes. So you could
index, say, trim(lower(person.name)). There are a *lot* of practical
situations where that comes in handy. The best workaround that I can
think of for not having that is to have a column defined to hold the
result of the function, maintain that column with a trigger, then
index that column. Which works, but is inelegant. (It also requires
storing completely redundant data.)

Is there any prospect of postgres aquiring that functionality?

Ben

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kenneth Marshall 2007-09-04 00:27:31 Re: Hash index todo list item
Previous Message Tom Lane 2007-09-04 00:03:26 Re: tsearch filenames unlikes special symbols and numbers