Re: Proposal for CSN based snapshots

From: Markus Wanner <markus(at)bluegap(dot)ch>
To: Ants Aasma <ants(at)cybertec(dot)at>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal for CSN based snapshots
Date: 2013-06-07 11:59:49
Message-ID: 51B1CB35.4050406@bluegap.ch
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Ants,

On 06/07/2013 12:42 AM, Ants Aasma wrote:
> Given the recent ideas being thrown about changing how freezing and
> clog is handled and MVCC catalog access I thought I would write out
> the ideas that I have had about speeding up snapshots in case there is
> an interesting tie in with the current discussions.

Thanks for this write-up.

> To refresh your memory the basic idea is to change visibility
> determination to be based on a commit sequence number (CSN for short)
> - a 8 byte number incremented on every commit representing the total
> ordering of commits. To take a snapshot in this scheme you only need
> to know the value of last assigned CSN, all transactions with XID less

You mean with a *CSN* less than or equal to that number? There's
certainly no point in comparing a CSN with an XID.

> than or equal to that number were commited at the time of the
> snapshots, everything above wasn't committed. Besides speeding up
> snapshot taking, this scheme can also be a building block for
> consistent snapshots in a multi-master environment with minimal
> communication.

Agreed. Postgres-R uses a CommitOrderId, which is very similar in
concept, for example.

> The main tricky part about this scheme is finding the CSN that was
> assigned to each XIDs in face of arbitrarily long transactions and
> snapshots using only a bounded amount of shared memory. The secondary
> tricky part is doing this in a way that doesn't need locks for
> visibility determination as that would kill any hope of a performance
> gain.

Agreed. Para-phrasing, you can also say that CSNs can only ever identify
completed transactions, where XIDs can be used to identify transactions
in progress as well.

[ We cannot possibly get rid of XIDs entirely for that reason. And the
mapping between CSNs and XIDs has some additional cost. ]

> We need to keep around CSN slots for all currently running
> transactions and CSN slots of transactions that are concurrent with
> any active CSN based snapshot (xid.csn > min(snapshot.csn)). To do
> this I propose the following datastructures to do the XID-to-CSN
> mapping. For most recently assigned XIDs there is a ringbuffer of
> slots that contain the CSN values of the XIDs or special CSN values
> for transactions that haven't completed yet, aborted transactions or
> subtransactions. I call this the dense CSN map. Looking up a CSN of a
> XID from the ringbuffer is just a trivial direct indexing into the
> ring buffer.
>
> For long running transactions the ringbuffer may do a full circle
> before a transaction commits. Such CSN slots along with slots that are
> needed for older snapshots are evicted from the dense buffer into a
> sorted array of XID-CSN pairs, or the sparse mapping. For locking
> purposes there are two sparse buffers, one of them being active the
> other inactive, more on that later. Looking up the CSN value of a XID
> that has been evicted into the sparse mapping is a matter of
> performing a binary search to find the slot and reading the CSN value.

I like this idea of dense and sparse "areas". Seems like a simple
enough, yet reasonably compact representation that might work well in
practice.

> Because snapshots can remain active for an unbounded amount of time
> and there can be unbounded amount of active snapshots, even the sparse
> mapping can fill up.

I don't think the number of active snapshots matters - after all, they
could all refer the same CSN. So that number shouldn't have any
influence on the size of the sparse mapping.

What does matter is the number of transactions referenced by such a
sparse map. You are of course correct in that this number is equally
unbounded.

> To handle that case, each backend advertises its
> lowest snapshot number csn_min. When values need to be evicted from
> the sparse mapping, they are evicted in CSN order and written into the
> CSN log - a series of CSN-XID pairs. Backends that may still be
> concerned about those values are then notified that values that they
> might need to use have been evicted. Backends with invalidated
> snapshots can then convert their snapshots to regular list of
> concurrent XIDs snapshots at their leisure.
>
> To convert a CSN based snapshot to XID based, a backend would first
> scan the shared memory structures for xids up to snapshot.xmax for
> CSNs that are concurrent to the snapshot and insert the XIDs into the
> snapshot, then read in the CSN log starting from snapshots CSN,
> looking for xid's less than the snapshots xmax. After this the
> snapshot can be handled like current snapshots are handled.

Hm, I dislike the requirement to maintain two different snapshot formats.

Also mind that snapshot conversions - however unlikely you choose to
make them - may well result in bursts as multiple processes may need to
do such a conversion, all starting at the same point in time.

> Rolling back a transaction
> --------------------------
>
> Rolling back a transaction is basically the same as committing, but
> the CSN slots need to be stamped with a AbortedCSN.

Is that really necessary? After all, an aborted transaction behaves
pretty much the same as a transaction in progress WRT visibility: it's
simply not visible.

Or why do you need to tell apart aborted from in-progress transactions
by CSN?

> Sizing the CSN maps
> -------------------
>
> CSN maps need to sized to accomodate the number of backends.
>
> Dense array size should be picked so that most xids commit before
> being evicted from the dense map and sparse array will contain slots
> necessary for either long running transactions or for long snapshots
> not yet converted to XID based snapshots. I did a few quick
> simulations to measure the dynamics. If we assume a power law
> distribution of transaction lengths and snapshots for the full length
> of transactions with no think time, then 16 slots per backend is
> enough to make the probability of eviction before commit less than
> 1e-6 and being needed at eviction due to a snapshot about 1:10000. In
> the real world very long transactions are more likely than predicted
> model, but at least the normal variation is mostly buffered by this
> size. 16 slots = 128bytes per backend ends up at a 12.5KB buffer for
> the default value of 100 backends, or 125KB for 1000 backends.

Sounds reasonable to me.

> Sparse buffer needs to be at least big enough to fit CSN slots for the
> xids of all active transactions and non-overflowed subtransactions. At
> the current level PGPROC_MAX_CACHED_SUBXIDS=64, the minimum comes out
> at 16 bytes * (64 + 1) slots * 100 = backends = 101.6KB per buffer,
> or 203KB total in the default configuration.

A CSN is 8 bytes, the XID 4, resulting in 12 bytes per slot. So I guess
the given 16 bytes includes alignment to 8 byte boundaries. Sounds good.

> Performance discussion
> ----------------------
>
> Taking a snapshot is extremely cheap in this scheme, I think the cost
> will be mostly for publishing the csnmin and rechecking validity after
> it. Taking snapshots in the shadow of another snapshot (often the case
> for the proposed MVCC catalog access) will be even cheaper as we don't
> have to advertise the new snapshot. The delayed XID based snapshot
> construction should be unlikely, but even if it isn't the costs are
> similar to GetSnapshotData, but we don't need to hold a lock.
>
> Visibility checking will also be simpler as for the cases where the
> snapshot is covered by the dense array it only requires two memory
> lookups and comparisons.

Keep in mind, though, that both of these lookups are into shared memory.
Especially the dense ring buffer may well turn into a point of
contention. Or at least the cache lines holding the most recent XIDs
within that ring buffer.

Where as currently, the snapshot's xip array resides in process-local
memory. (Granted, often enough, the proc array also is a point of
contention.)

> At this point I don't see any major issues with this approach. If the
> ensuing discussion doesn't find any major showstoppers then I will
> start to work on a patch bit-by-bit.

Bit-by-bit? Reminds me of punched cards. I don't think we accept patches
in that format. :-)

> we have a small one in the family.

Congratulations on that one.

Regards

Markus Wanner

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bernd Helmle 2013-06-07 12:18:46 Re: Hard limit on WAL space used (because PANIC sucks)
Previous Message Amit Kapila 2013-06-07 11:37:27 Re: Performance Improvement by reducing WAL for Update Operation