At the just-past OSDN database conference, Bruce and I were annoyed by
some benchmark results showing that Postgres performed poorly on an
8-way SMP machine. Based on past discussion, it seems likely that the
culprit is the known inefficiency in our spinlock implementation.
After chewing on it for awhile, we came up with an idea for a solution.
The following proposal should improve performance substantially when
there is contention for a lock, but it creates no portability risks
because it uses the same system facilities (TAS and SysV semaphores)
that we have always relied on. Also, I think it'd be fairly easy to
implement --- I could probably get it done in a day.
regards, tom lane
Replace most uses of spinlocks with "lightweight locks" (LW locks)
implemented by a new lock manager. The principal remaining use of true
spinlocks (TAS locks) will be to provide mutual exclusion of access to
LW lock structures. Therefore, we can assume that spinlocks are never
held for more than a few dozen instructions --- and never across a kernel
It's pretty easy to rejigger the spinlock code to work well when the lock
is never held for long. We just need to change the spinlock retry code
so that it does a tight spin (continuous retry) for a few dozen cycles ---
ideally, the total delay should be some small multiple of the max expected
lock hold time. If lock still not acquired, yield the CPU via a select()
call (10 msec minimum delay) and repeat. Although this looks inefficient,
it doesn't matter on a uniprocessor because we expect that backends will
only rarely be interrupted while holding the lock, so in practice a held
lock will seldom be encountered. On SMP machines the tight spin will win
since the lock will normally become available before we give up and yield
Desired properties of the LW lock manager include:
* very fast fall-through when no contention for lock
* waiting proc does not spin
* support both exclusive and shared (read-only) lock modes
* grant lock to waiters in arrival order (no starvation)
* small lock structure to allow many LW locks to exist.
Proposed contents of LW lock structure:
spinlock mutex (protects LW lock state and PROC queue links)
count of exclusive holders (always 0 or 1)
count of shared holders (0 .. MaxBackends)
queue head pointer (NULL or ptr to PROC object)
queue tail pointer (could do without this to save space)
If a backend sees it must wait to acquire the lock, it adds its PROC
struct to the end of the queue, releases the spinlock mutex, and then
sleeps by P'ing its per-backend wait semaphore. A backend releasing the
lock will check to see if any waiter should be granted the lock. If so,
it will update the lock state, release the spinlock mutex, and finally V
the wait semaphores of any backends that it decided should be released
(which it removed from the lock's queue while holding the sema). Notice
that no kernel calls need be done while holding the spinlock. Since the
wait semaphore will remember a V occurring before P, there's no problem
if the releaser is fast enough to release the waiter before the waiter
reaches its P operation.
We will need to add a few fields to PROC structures:
* Flag to show whether PROC is waiting for an LW lock, and if so
whether it waits for read or write access
* Additional PROC queue link field.
We can't reuse the existing queue link field because it is possible for a
PROC to be waiting for both a heavyweight lock and a lightweight one ---
this will occur when HandleDeadLock or LockWaitCancel tries to acquire
the LockMgr module's lightweight lock (formerly spinlock).
It might seem that we also need to create a second wait semaphore per
backend, one to wait on HW locks and one to wait on LW locks. But I
believe we can get away with just one, by recognizing that a wait for an
LW lock can never be interrupted by a wait for a HW lock, only vice versa.
After being awoken (V'd), the LW lock manager must check to see if it was
actually granted the lock (easiest way: look at own PROC struct to see if
LW lock wait flag has been cleared). If not, the V must have been to
grant us a HW lock --- but we still have to sleep to get the LW lock. So
remember this happened, then loop back and P again. When we finally get
the LW lock, if there was an extra P operation then V the semaphore once
before returning. This will allow ProcSleep to exit the wait for the HW
lock when we return to it.
While waiting for an LW lock, we need to show in our PROC struct whether
we are waiting for read or write access. But we don't need to remember
this after getting the lock; if we know we have the lock, it's easy to
see by inspecting the lock whether we hold read or write access.
ProcStructLock cannot be replaced by an LW lock, since a backend cannot
use an LW lock until it has obtained a PROC struct and a semaphore,
both of which are protected by this lock. It seems okay to use a plain
spinlock for this purpose. NOTE: it's okay for SInvalLock to be an LW
lock, as long as the LW mgr does not depend on accessing the SI array
of PROC objects, but only chains through the PROCs themselves.
Another tricky point is that some of the setup code executed by the
postmaster may try to to grab/release LW locks. Here, we can probably
allow a special case for MyProc=NULL. It's likely that we should never
see a block under these circumstances anyway, so finding MyProc=NULL when
we need to block may just be a fatal error condition.
A nastier case is checkpoint processes; these expect to grab BufMgr and
WAL locks. Perhaps okay for them to do plain sleeps in between attempts
to grab the locks? This says that the MyProc=NULL case should release
the spinlock mutex, sleep 10 msec, try again, rather than any sort of error
or expectation of no conflict. Are there any cases where this represents
a horrid performance loss? Checkpoint itself seems noncritical.
Alternative is for checkpoint to be allowed to create a PROC struct (but
not to enter it in SI list) so's it can participate normally in LW lock
operations. That seems a good idea anyway, actually, so that the PROC
struct's facility for releasing held LW locks at elog time will work
inside the checkpointer. (But that means we need an extra sema too?
Okay, but don't want an extra would-be backend to obtain the extra sema
and perhaps cause a checkpoint proc to fail. So must allocate the PROC
and sema for checkpoint process separately from those reserved for
pgsql-hackers by date
|Next:||From: Mike Rogers||Date: 2001-09-26 16:13:17|
|Subject: PostgreSQL / PHP Overrun Error|
|Previous:||From: Tom Lane||Date: 2001-09-26 15:19:00|
|Subject: Re: optimizer question |