Re: Further reduction of bufmgr lock contention

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Gavin Hamill <gdh(at)acentral(dot)co(dot)uk>
Subject: Re: Further reduction of bufmgr lock contention
Date: 2006-07-20 19:41:28
Message-ID: 9019.1153424488@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

A couple months ago we were discussing partitioning the buffer mapping
table so as to reduce contention for the BufMappingLock. The discussion
stalled after I complained about the uncertainty of shared memory usage
arising from having a separate hashtable for each partition (since
different numbers of buffers might fall into each partition at different
times). I've been thinking about it more after seeing an example from
Tatsuo that seems to be exactly the same problem as Gavin saw.

We could fix the uncertain-usage objection if we stick with a single
hashtable and ensure that concurrent access to different partitions of
it is safe. I believe this is easily doable, if we make hashtables
used in this mode allocate the maximum allowed number of buckets
immediately during hashtable initialization. (Since shared hashtables
already have a fixed maximum directory size, they already have an upper
bound on the number of buckets, so this loses no flexibility.) Then
there will be no on-the-fly bucket splits, and that means that accesses
to different hash buckets operate completely independently. Therefore,
we can regard the hashtable as logically partitioned on the basis of any
classification of entries that will uniquely assign hash buckets to
classes --- taking the low order bits of entry hash codes will do fine.

The only changeable state that is shared across all buckets is the entry
freelist and the "nentries" counter. We could protect these with a
spinlock (one spinlock is sufficient since changes to nentries go along
with addition or removal of freelist entries).

Usage of a partitioned hash table would then be like

compute hashtable lookup key;
entryhashcode = calc_hash(lookup key);
partitionnumber = entryhashcode % NumPartitions;
LWLockAcquire(PartitionLock[partitionnumber], ...);
manipulate hashtable;
LWLockRelease(PartitionLock[partitionnumber]);

We could do this without changing the API of hash_search, but then we'd
be computing the entry hashcode twice, so I'm inclined to provide an
additional entry point that takes a precalculated hashcode.

Potential downsides of applying this idea to the buffer mapping table:

1. Reassigning a buffer to a new page will (usually) require two cycles
of LWLockAcquire/Release for the two different partitions involved.
Since this path also requires at least a read() kernel call (maybe a
write() too), I don't think there'll be any meaningful slowdown.

2. The current logic for reassigning a buffer attempts to make a
hashtable entry for its new page number (to detect collisions) before
releasing the old hashtable entry. This would only be safe if we held
both partition LWLocks concurrently; which seems bad for concurrency,
plus avoiding deadlock would complicate the code significantly. I'm
inclined to release the old entry and then try to insert the new one,
holding only one lock at a time. If the insertion fails (because
someone was concurrently loading the same page), we'd have to throw the
buffer onto the freelist rather than allowing it to retain its previous
valid data. Which is slightly annoying, but in practice the case
probably doesn't happen often enough to be worth worrying about.

3. Taking the freelist spinlock is new computation that wasn't there
before. But, again, it's only needed in code paths that will also be
doing a kernel call.

If we do this we should probably also handle the lmgr lock tables the
same way (partially reverting my partition-the-LockMgrLock patch of a
couple months ago). However, downside #3 might be a stronger objection
for lmgr, since it can create or destroy lock objects without necessarily
doing any I/O.

Thoughts, objections?

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Zdenek Kotala 2006-07-20 20:51:27 Re: Units in postgresql.conf
Previous Message Tom Lane 2006-07-20 19:01:58 Re: Sun Donated a Sun Fire T2000 to the PostgreSQL