hash index work

From: Neil Conway <neilc(at)samurai(dot)com>
To: pgsql-patches <pgsql-patches(at)postgresql(dot)org>
Subject: hash index work
Date: 2005-05-13 05:13:48
Message-ID: 4284378C.2080205@samurai.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-patches

This patch implements two changes to hash indexes:

- rather than storing the values of the index keys, we just store their
hash value instead. The hash opclasses have been marked "lossy" so that
the executor will recheck the scan qualifier to catch hash collisions.

- within a page of a hash bucket, the entries are stored sorted by hash
value. When doing a hash index scan, we can then binary search within a
page rather than using a linear search.

Unfortunately, I haven't been able to measure any performance
improvement from either of these changes :-\

I've run tests on two types of tables: int4 keys and relatively short
textual keys (I don't think there's much point testing longer keys: the
patches should obviously be a win there, so I'm concerned about speeding
up the common case). In both cases, the difference in index scan
performance isn't significant: it's about the same with and without the
patches. The indexes have been on tables with 1 million rows (of int4)
and 400,000 rows (of text). I would test with larger tables, but
creating hash indexes is so slow that even these sizes are painful
enough. Since indexes of this size should be cached in memory the
reduction in I/O for the text case isn't being measured (the index is
about 30% smaller than it is when we store the full text value), but
even so I would have expected the binary search to produce noticeable
gains...

Perhaps I've made a coding error that has pessimized the performance
somehow (although nothing obvious shows up in profiles), or perhaps the
linear scan of the pages of the hash bucket isn't a performance problem
in the first place, at least for types with a cheap equality operator.
Some oprofile data seems to support this -- for example, in the int4
case, we spend less than 0.5% of the time in _hash_next and children,
and only 1.8% of the runtime in hashgetmulti and children (with the
vanilla CVS HEAD code).

-Neil

Attachment Content-Type Size
hash_index_opt-10.patch text/x-patch 35.9 KB

Responses

Browse pgsql-patches by date

  From Date Subject
Next Message Neil Conway 2005-05-13 06:40:39 Re: CSV consecutive newline bug
Previous Message Alvaro Herrera 2005-05-13 03:06:59 Refactoring in lock.c