Re: hash index concurrency

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: hash index concurrency
Date: 2012-05-30 03:54:13
Message-ID: CA+TgmoZ6QGOd7n4rHP98QXXP-PfgajmnR1++p+M=6TKOoinj8Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, May 29, 2012 at 11:21 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> On Tue, May 29, 2012 at 5:19 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> I ran a SELECT-only pgbench test today on the IBM POWER7 box with 64
>> concurrent clients and got roughly 305,000 tps.  Then, I created a
>> hash index on pgbench_accounts (aid), dropped the primary key, and
>> reran the test.  I got roughly 104,000 tps. 'perf -g -e cs' suggested
>> lock contention in _hash_first(), so I whacked out the calls to
>> _hash_getlock(rel, 0, HASH_SHARE) and _hash_droplock(rel, 0,
>> HASH_SHARE).  With that change, I got roughly 270,000 TPS.  Taking a
>> little further, I then changed the definition of USELOCKING() to 0,
>> effectively removing all the heavyweight page locks.  With that
>> change, I got 324,000 tps.  Also, with this change, the CPU is fully
>> saturated - about 77% user time, 23% system time.
>
> Did the data fit in shared_buffers, or only in RAM?

shared_buffers.

>> I don't immediately have a proposal to deal with this, but it seems
>> that if we want good hash index performance under high concurrency, we
>> need to work a bit harder.
>
> This was a hobby horse of mine a couple of years ago, but I never got
> much traction.  The main question I have is, what do we even want hash
> indexes to be?  NBTree is very good, has been extensively optimized,
> and extensively tested.  If there is a niche left for hash indexes,
> what is it?  Is it just very large keys which don't do well in BTrees,
> or something else?

Well, TBH, I was hoping they'd be faster than btree.

> If we really want to embrace a niche other than very large keys, I
> would say rip out the dynamic bucket splitting altogether.

Yeah, maybe, although that's got some disadvantages.

> Barring that:
>
> 1) on each main bucket page, record what the level of split for that
> bucket is.  That way you don't need to interlock the heavy locks
> between the meta page and the main bucket page.  Rather,  when you
> leave the meta page you remember not just what bucket you are heading
> for but also how far it has been split.  Then when you get there, if
> the split level you find doesn't match what you remember, you have to
> go back to the meta page and try again.  Indeed, this way you not only
> don't need to heavy-lock the meta page, you don't even need to visit
> it at all.  Just store the look-up table in the relcache, and then you
> only need to reload it from the metapage when you realize it is out of
> date.  But, you still need to heavy lock the bucket itself.  But
> implementing this does require an on-disk change to the hash index
> format.

This occurred to me almost at once when I looked at the code, and I
think it might be a good idea, though I also think it will not solve
the main problem here.

> 2) Only support bitmap scans and not ordinary tid scans (the way gin
> indexes already do).  Now you never need to pass execution back out of
> the hash am while you are still in a bucket, so you don't need heavy
> weight locks at all.  You would still need to interlock LWLocks as you
> move down the chain of overflow pages, but that should be much more
> concurrent.

-1 on losing amgettuple. I regret that we lost that for GIN and I
shall regret it more if we lose it anywhere else.

But... even without doing either of the above, how about trying to
eliminate the heavyweight locking around the metapage? I think the
only reason we're using it there is for deadlock detection - we can't
unlock the metapage until we've got the bucket lock, and we can't hold
an lwlock on the metapage while acquiring a heavyweight lock for fear
of deadlock (though I'm slightly unclear on what the deadlock scenario
is). Instead, suppose we take a shared buffer content lock on the
metapage, compute the target bucket, and release the buffer content
lock on the metapage. Next, we acquire the heavyweight lock on the
bucket. Then, we reacquire the buffer content lock on the metapage
and double-check that no split has occurred. If it has, we call a
do-over; else, we release the lwlock and proceed.

This would actually be fewer lwlock acquisitions than we have now,
because losing the heavyweight acquire-and-release would save us two
lwlock acquire-and-release cycles, both in exclusive mode, and instead
we'd add the double-check code which would be only one lwlock
acquire-and-release cycle, and that in shared mode.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2012-05-30 04:01:52 Re: FDW / list of needed columns, WHERE conditions (in PlanForeignScan)
Previous Message Jeff Janes 2012-05-30 03:21:03 Re: hash index concurrency