cheaper snapshots redux

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: cheaper snapshots redux
Date: 2011-08-22 21:25:28
Message-ID: CA+TgmoYD6EhYy1Rb+SEuns5smreY1_3rAMeL=76rX8deijy56Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jul 27, 2011 at 10:51 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, Oct 20, 2010 at 10:07 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> I wonder whether we could do something involving WAL properties --- the
>> current tuple visibility logic was designed before WAL existed, so it's
>> not exploiting that resource at all.  I'm imagining that the kernel of a
>> snapshot is just a WAL position, ie the end of WAL as of the time you
>> take the snapshot (easy to get in O(1) time).  Visibility tests then
>> reduce to "did this transaction commit with a WAL record located before
>> the specified position?".  You'd need some index datastructure that made
>> it reasonably cheap to find out the commit locations of recently
>> committed transactions, where "recent" means "back to recentGlobalXmin".
>> That seems possibly do-able, though I don't have a concrete design in
>> mind.
>
> [discussion of why I don't think an LSN will work]
>
> But having said that an LSN can't work, I don't see why we can't just
> use a 64-bit counter.  In fact, the predicate locking code already
> does something much like this, using an SLRU, for serializable
> transactions only.  In more detail, what I'm imagining is an array
> with 4 billion entries, one per XID, probably broken up into files of
> say 16MB each with 2 million entries per file.  Each entry is a 64-bit
> value.  It is 0 if the XID has not yet started, is still running, or
> has aborted.  Otherwise, it is the commit sequence number of the
> transaction.

I've been giving this quite a bit more thought, and have decided to
abandon the scheme described above, at least for now. It has the
advantage of avoiding virtually all locking, but it's extremely
inefficient in its use of memory in the presence of long-running
transactions. For example, if there's an open transaction that's been
sitting around for 10 million transactions or so and has an XID
assigned, any new snapshot is going to need to probe into the big
array for any XID in that range. At 8 bytes per entry, that means
we're randomly accessing about ~80MB of memory-mapped data. That
seems problematic both in terms of blowing out the cache and (on small
machines) possibly even blowing out RAM. Nor is that the worst case
scenario: a transaction could sit open for 100 million transactions.

Heikki has made the suggestion a few times (and a few other people
have since made somewhat similar suggestions in different words) of
keeping an-up-to-date snapshot in shared memory such that transactions
that need a snapshot can simply copy it. I've since noted that in Hot
Standby mode, that's more or less what the KnownAssignedXids stuff
already does. I objected that, first, the overhead of updating the
snapshot for every commit would be too great, and second, it didn't
seem to do a whole lot to reduce the size of the critical section, and
therefore probably wouldn't improve performance that much. But I'm
coming around to the view that these might be solvable problems rather
than reasons to give up on the idea altogether.

With respect to the first problem, what I'm imagining is that we not
do a complete rewrite of the snapshot in shared memory on every
commit. Instead, when a transaction ends, we'll decide whether to (a)
write a new snapshot or (b) just record the XIDs that ended. If we do
(b), then any backend that wants a snapshot will need to copy from
shared memory both the most recently written snapshot and the XIDs
that have subsequently ended. From there, it can figure out which
XIDs are still running. Of course, if the list of recently-ended XIDs
gets too long, then taking a snapshot will start to get expensive, so
we'll need to periodically do (a) instead. There are other ways that
this could be done as well; for example, the KnownAssignedXids stuff
just flags XIDs that should be ignored and then periodically compacts
away the ignored entries.

I think the real trick is figuring out a design that can improve
concurrency. If you keep a snapshot in shared memory and periodically
overwrite it in place, I don't think you're going to gain much.
Everyone who wants a snapshot still needs a share-lock and everyone
who wants to commit still needs an exclusive-lock, and while you might
be able to make the critical section a bit shorter, I think it's still
going to be hard to make big gains that way. What I'm thinking about
instead is using a ring buffer with three pointers: a start pointer, a
stop pointer, and a write pointer. When a transaction ends, we
advance the write pointer, write the XIDs or a whole new snapshot into
the buffer, and then advance the stop pointer. If we wrote a whole
new snapshot, we advance the start pointer to the beginning of the
data we just wrote.

Someone who wants to take a snapshot must read the data between the
start and stop pointers, and must then check that the write pointer
hasn't advanced so far in the meantime that the data they read might
have been overwritten before they finished reading it. Obviously,
that's a little risky, since we'll have to do the whole thing over if
a wraparound occurs, but if the ring buffer is large enough it
shouldn't happen very often. And a typical snapshot is pretty small
unless massive numbers of subxids are in use, so it seems like it
might not be too bad. Of course, it's pretty hard to know for sure
without coding it up and testing it.

There are a couple of reasons to think that this design might be
better than what we're doing now. Writers (people trying to commit)
can overlap with readers (people trying to take a snapshot), so long
as any reader is guaranteed to see the most recently completed write
(and no portion of a write that is still in progress). Also, this
approach admits some optimizations that are hard to manage with our
current scheme. For example, you can cache the data that you most
recently removed from the array. If you go to take a new snapshot and
find that the stop pointer hasn't moved (I'm imagining it as a 64-bit
counter that never wraps), then you don't need to copy the data out of
the array again; you can just reuse what you got last time. Maybe
that doesn't matter, since contents of the ProcArray change pretty
quickly, but then again I've seen quite a few profiles where
GetSnapshotData() was very high up in the list, so it seems at least
possible that cache line contention is making access to shared memory
slow enough that there is value in trying to optimize some of it away.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jim Nasby 2011-08-22 21:37:23 Re: How to define global variable in postgresql
Previous Message Jim Nasby 2011-08-22 21:17:26 Re: Single pass vacuum - take 2