Re: spinlock contention

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Florian Pflug <fgp(at)phlo(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: spinlock contention
Date: 2011-06-27 15:54:28
Message-ID: BANLkTinmiDxt4-yFL6edWBhts+uqJ6G9Ew@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Jun 25, 2011 at 8:26 PM, Greg Stark <stark(at)mit(dot)edu> wrote:
> On Thu, Jun 23, 2011 at 4:42 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> ProcArrayLock looks like a tougher nut to crack - there's simply no
>> way, with the system we have right now, that you can take a snapshot
>> without locking the list of running processes.  I'm not sure what to
>> do about that, but we're probably going to have to come up with
>> something, because it seems clear that once we eliminate the lock
>> manager LWLock contention, this is a major bottleneck.
>
> Well as Tom observed earlier the kernel of a snapshot is actually a
> LSN. A snapshot contains a set of xids which all committed before some
> LSN and none which committed after it.
>
> So if we had a record of what log sequence number the commit record
> for any given transaction is we could build the snapshot at our
> leisure without any exclusive lock. In fact we could even build it
> lazily as a kind of cache only when we actually are interested in a
> given xid.

Yeah, I've been thinking about that. I think what we might do is set
up a new SLRU that works like CLOG, but each entry is an LSN rather
than just two bits. When a transaction commits, we save the commit
LSN under the entry for that XID. We truncate away SLRU pages that
contain no still-running XIDs. When we need to check whether an XID
is visible to our snapshot, we just look up the commit LSN and compare
it with our snapshot LSN. If it's before and non-zero, we can see it.
If it's after or all-zeroes, we can't.

But I'm not sure how much this would really help. It might (subject
to working out the details) make the actual process of taking a
snapshot faster. But it's not clear that taking snapshots more
quickly will actually help anything, because the problem is not the
amount of time spending taking the snapshot. The problem is rather
that doing so requires acquiring and releasing an LWLock, and each of
those operations requires taking and releasing a spinlock. And it is
the spinlock contention that is killing us. That makes me think we
need a way to take a snapshot without taking a spinlock. Or if we
must take spinlocks, we at least have to avoid every backend that
needs a snapshot lusting after the *same* spinlock.

What I've been thinking about this weekend is whether it might be
possible to create a sort of lock-free queue infrastructure. When a
backend starts up, it would add an entry to the queue saying "I'm
running". When it commits, it would add an entry to the queue saying
"I'm committed". All entries would be added at the *end* of the
queue, so a backend scanning the queue to build up a snapshot wouldn't
ever be able to see commits out of order. We would need some memory
barrier operations on weak-memory-ordering machines to make sure that
the queue writes became visible before the end-of-queue pointer bump.

The trick is figuring out how to clean up the queue. Since "commit"
entries exist only to guard against "running" entries earlier in the
queue, the start-of-queue pointer can be advanced whenever it points
to a "commit" entry. Also, if it points to a "running" entry for
which there is a later "commit" entry, then the start-of-queue pointer
can be advanced over that as well. However, just because we can
advance the point at which backends start reading doesn't mean that we
can actually recycle space, because while we know that new scans
needn't worry about those entries, we *don't* know that there isn't
already a scan in flight that still needs them. Furthermore, if a
transaction runs for a long time, we can never advance the
start-of-queue pointer past the "running" entry for its XID, which is
obviously problematic since the queue would get very long.

To work around that problem, I think we could use Florian's idea
upthread of an RCU system. We keep two copies of the queue around, an
A copy and a B copy. When the currently active copy fills up, we
rewrite it into the other queue, omitting all "committed" entries and
any "running" entries that have matching "committed" entries, and then
tell everyone to start looking at that copy instead. We would need
some kind of gymnastics to make sure that we don't flip from the A
copy to the B copy and back to the A copy while some laggardly backend
is still hoping to scan the old A copy. A simple algorithm (there's
probably a smarter one) would be to have each backend take a spinlock
while it's scanning either copy, and to have the backend that is doing
the rewrite take and release all of those spinlocks one at a time
before beginning the rewrite, thus guaranteeing that any scans still
in progress when the rewrite is requested have completed before it's
actually performed. Any new scans started in the meanwhile will
certainly be looking at the current copy rather than the old copy
we're about to overwrite.

We would still need a lock around the operation of adding new items to
the queue; if two backends try to do that at the same time, chaos will
ensue. But if that lock gets busy, it would be possible to have
backends use some sort of system of elimination buffers to combine
their requests. If ten backends each want to advertise a new XID, it
will be far faster to have one backend acquire the lock, write in all
ten entries, and wake everyone up than to have each backend insert its
entry and wake up the next one. So what we might do is doing a
condition-acquire on the lock to insert data into the buffer and, if
we can't get it immediately, go find another backend with the same
problem and elect a leader to do all the insertions. If we're the
leader, sleep on the queue-insertion lock. If not, sleep until the
leader wakes us up.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2011-06-27 15:57:35 Re: Deriving release notes from git commit messages
Previous Message Jonathan Corbet 2011-06-27 15:49:14 Re: Deriving release notes from git commit messages