Re: Hash index todo list item

From: Kenneth Marshall <ktm(at)rice(dot)edu>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Hash index todo list item
Date: 2007-09-03 17:36:42
Message-ID: 20070903173641.GH16568@it.is.rice.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Sep 03, 2007 at 10:33:54AM +0100, Simon Riggs wrote:
> >
> > This is the rough plan. Does anyone see anything critical that
> > is missing at this point? Please send me any suggestions for test
> > data and various performance test ideas, since I will be working
> > on that first.
>
> Sounds good.
>
> I'd be particularly interested in large indexes, say ~ 0.5 - 2GB. There
> are likely to be various effects apparent as the indexes grow. It would
> be too easy to do all the tests with smaller indexes and miss things.
>
> Other factors are:
> - volatility
> - concurrency
>
> My general experience is that hash-based indexes are better when the
> range of inputs is relatively well-known, allowing a fast lookup. If
> that is the only benefit of hash indexes, a flexible hashing scheme may
> simply weaken the benefit-case for using them. If that's true, should
> the index build process examine the key values in the data to determine
> the best parameters to use? Kind of ANALYZE before build.
>
> My current feeling is that they ought to be very good at handling
> read-mostly situations such as privilege checking or UPDATE-intensive
> situations such as Customer-Current-Credit tracking, when the number of
> customers is large.
>
> It might also be worth looking at lossy hash indexes, i.e. the index
> stores only the block numbers. That would need to be part of the
> discussion around how lossy we will allow indexes to be.
>
> We currently have two kinds of full text index with different
> concurrency use cases, so it should be acceptable to have hash indexes
> have a clear benefit in one use case but a clear loss in another.
>
> --
> Simon Riggs
> 2ndQuadrant http://www.2ndQuadrant.com
>

Simon,

Thank you for your input. I would like to include some tests with large
indexes too. Do you have any ideas for a test corpus or should we try
and generate the test data programatically? Many people in the literature
of text retrieval use the TREC* data for at least some of their runs. I
am going to check at work to see if the campus has access to the data,
otherwise I will do some web crawling to generate some sample data. I
have just posted a reply to Tom Lane with some further ideas for consideration
in the new hash index support. Like you, I suspect that volatile data that
results in many index changes may not work well with hash indexes, in general.
PostgreSQL has the additional burden of needing to access both the index and
the data heap. Obviously, the less I/O that is needed the better the
performance is likely to be. The new HOT functionality plus clustering the
table data on the hash index would effectively organize the table into the
"hash buckets" which could help with reducing both the churn in the index
as well as in the tables.

Regards,
Ken

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Florian Pflug 2007-09-03 18:01:08 Re: tsearch filenames unlikes special symbols and numbers
Previous Message Florian G. Pflug 2007-09-03 17:33:22 Re: Per-function GUC settings: trickier than it looked