Quick Links

Re: hash index concurrency

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: hash index concurrency
Date:	2012-05-30 03:21:03
Message-ID:	CAMkU=1xUy52xnn3qt+Briqt_7M259QEqLQqRNeBkPjGv0a3-pQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, May 29, 2012 at 5:19 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I ran a SELECT-only pgbench test today on the IBM POWER7 box with 64
> concurrent clients and got roughly 305,000 tps. Then, I created a
> hash index on pgbench_accounts (aid), dropped the primary key, and
> reran the test. I got roughly 104,000 tps. 'perf -g -e cs' suggested
> lock contention in _hash_first(), so I whacked out the calls to
> _hash_getlock(rel, 0, HASH_SHARE) and _hash_droplock(rel, 0,
> HASH_SHARE). With that change, I got roughly 270,000 TPS. Taking a
> little further, I then changed the definition of USELOCKING() to 0,
> effectively removing all the heavyweight page locks. With that
> change, I got 324,000 tps. Also, with this change, the CPU is fully
> saturated - about 77% user time, 23% system time.

Did the data fit in shared_buffers, or only in RAM?

>
> I don't immediately have a proposal to deal with this, but it seems
> that if we want good hash index performance under high concurrency, we
> need to work a bit harder.

This was a hobby horse of mine a couple of years ago, but I never got
much traction. The main question I have is, what do we even want hash
indexes to be? NBTree is very good, has been extensively optimized,
and extensively tested. If there is a niche left for hash indexes,
what is it? Is it just very large keys which don't do well in BTrees,
or something else?

If we really want to embrace a niche other than very large keys, I
would say rip out the dynamic bucket splitting altogether. Specify
the number of buckets when the index is built, and if you later change
your mind then use concurrent index to make a new larger one and then
drop the old one. Not needing to deal with dynamic buffer splitting
would make both highly concurrent access and Write-Ahead Logging far
simpler to implement. (It seems to me that solving either concurrency
or WAL is not going to get very far, they both need to be solved
together--or at least the design of the first to be implemented needs
to take into account the future implementation of the other).

Barring that:

1) on each main bucket page, record what the level of split for that
bucket is. That way you don't need to interlock the heavy locks
between the meta page and the main bucket page. Rather, when you
leave the meta page you remember not just what bucket you are heading
for but also how far it has been split. Then when you get there, if
the split level you find doesn't match what you remember, you have to
go back to the meta page and try again. Indeed, this way you not only
don't need to heavy-lock the meta page, you don't even need to visit
it at all. Just store the look-up table in the relcache, and then you
only need to reload it from the metapage when you realize it is out of
date. But, you still need to heavy lock the bucket itself. But
implementing this does require an on-disk change to the hash index
format.

2) Only support bitmap scans and not ordinary tid scans (the way gin
indexes already do). Now you never need to pass execution back out of
the hash am while you are still in a bucket, so you don't need heavy
weight locks at all. You would still need to interlock LWLocks as you
move down the chain of overflow pages, but that should be much more
concurrent.

Cheers,

Jeff

In response to

hash index concurrency at 2012-05-30 00:19:31 from Robert Haas

Responses

Re: hash index concurrency at 2012-05-30 03:54:13 from Robert Haas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Robert Haas	2012-05-30 03:54:13	Re: hash index concurrency
Previous Message	Peter Geoghegan	2012-05-30 01:29:19	Re: Uh, I change my mind about commit_delay + commit_siblings (sort of)