Hash index build patch has *worse* performance at small table sizes

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Tom Raney <twraney(at)comcast(dot)net>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Hash index build patch has *worse* performance at small table sizes
Date: 2008-03-16 19:50:13
Message-ID: 2292.1205697013@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I've been reviewing the hash index build patch submitted here:
http://archives.postgresql.org/pgsql-patches/2007-10/msg00154.php

Although it definitely helps on large indexes, it's actually
counterproductive on not-so-large ones. The test case I'm using
is random integers generated like this:
create table foo as select (random() * N)::int as f1
from generate_series(1,N);
select count(*) from foo; -- force hint bit updates
checkpoint;
then timing
create index fooi on foo using hash(f1);

Using all-default configuration settings on some not-very-new hardware,
at N = 1E6 I see

8.3.1: 30 sec
With pre-expansion of index (CVS HEAD): 24 sec
With sorting: 72 sec
To build a btree index on same data: 34 sec

Now this isn't amazingly surprising, because the original argument for
doing sorting was to improve locality of access to the index during
the build, and that only matters if you've got an index significantly
bigger than memory. If the index fits in RAM then the sort is pure
overhead.

The obvious response to this is to use the sorting approach only when
the estimated index size exceeds some threshold. One possible choice of
threshold would be shared_buffers (or temp_buffers for a temp index)
but I think that is probably too conservative, because in most scenarios
the kernel's disk cache is available too. Plus you can't tweak that
setting without a postmaster restart. I'm tempted to use
effective_cache_size, which attempts to measure an appropriate number
and can be set locally within the session doing the CREATE INDEX if
necessary. Or we could invent a new GUC parameter, but that is probably
overkill.

Comments?

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2008-03-16 21:57:09 Rewriting Free Space Map
Previous Message Tom Lane 2008-03-16 18:20:01 Re: Single table forcing sequential scans on query plans