I have been thinking about how to handle long running transactions with
Robert’s commit sequence number (CSN) idea.
I just started to go through transaction management code, so I would
appreciate if anyone could point out any gaping holes in my ideas.
There are two reasons why XID-to-CSN values need to be kept around. One
is that long running transactions need a place to publish their CSN for
CSN based snapshots. The other is that CSN based snapshots need to know
which transactions between xmin and xmax are visible and which are
My idea is to use hybrid storage for the XID-to-CSN storage. For recent
transactions shared memory contains a dense XID indexed ring buffer of
CSN values. Head of the buffer moves with nextXid. The tail is also
moved on xid assignment, but done in larger chunks to reduce lock
contention. When moving the tail up, any XID that might still be
invisible to a snapshot is moved to a sparse buffer. To keep the size of
the sparse buffer within limits, too old snapshots are converted into
list-of-running-xids snapshots. This allows us to move up the visible to
all csn-based snapshots horizon (csnmin) and release xids below that
horizon from the sparse buffer. CSNs themselves can be uint32s, this
means that no transaction can be older than 2 billion writing
transactions, which is already enforced by xids being 32 bit values.
Taking a snapshot using this idea would consist of a Xmin, Xmax and CSN.
Snapshot Xmin guarantees that no xid below Xmin can have a CSN bigger
than the snapshots. One way to get Xmin would be to maintain a global
value when doing the tail update of the dense array. Snapshot Xmax
guarantees that no xid above Xmax can have a CSN less than the
snapshots. Xmax can be easily obtained from nextXid. CSN is simply the
current value of the CSN counter. After the snapshot is taken, if
necessary the csnmin of the proc needs to be updated and snapshot added
to the list of acquired snapshots.
All of the guarantees of taking a snapshot can be maintained without
global locking if we have atomic reads/writes of CSN and XID values. We
need to obtain them in the order Xmin -> CSN -> Xmax, inserting read
barriers between the loads. Write side ordering is guaranteed by the CSN
lock. Updating the proc entry can be done under the per proc lock.
To check visibility of a xid between snapshots xmin and xmax, we first
check if its above the dense xid array tail. If it is in dense, just
read the CSN value and compare it to the snapshot value. If xid falls
under the ringbuffer tail, go through the sparse buffer. If the xid is
found in the sparse buffer, just read the CSN value and compare. If it
is not found, then it is guaranteed to be completed before the snapshot,
just read the clog status for the transaction. This is all done without
locking. When looking something up from the dense array, we have to
issue a read barrier and recheck the tail location to ensure that it
hasn’t moved out from under us. If it has, retry.
To commit a transaction we take a lock on the CSN, stamp all our XIDs
with the value, update clog, a write barrier to ensure that lock free
readers see the changes, increment the csn and relase the lock. For each
of our XIDs we need to check if it is above or below the dense
ringbuffer tail and update the respective value.
Moving of the dense ringbuffer tail should be done when assigning XIDs.
This allows us to enforce a hard limit on the size of the ringbuffer. To
reduce contention on the CSN lock, this should be done in large
increments. 1/8 of the ringbuffer at a time seems like a good compromise
between frequency and buffer space efficiency.
To move the tail, we first find the global minimum value of snapshot
CSNs (csnmin). If every proc maintains its own csnmin under per proc
lock, we can just get the current csn and go through all the procs, take
the per proc lock and fetch the csnmin to get the global value. Then we
go through XIDs between old tail and new tail and find all that are
still running or not visible to csnmin. We insert the XIDs into the
sparse buffer, issue a write barrier so they are visible everyone and
then update the tail location. Because we are trawling through the
procarray here anyway, its also a good time to update global xmin.
Because we want to avoid unbounded growth of the sparse buffer, we need
to get rid of old CSN based snapshots. Tying this into XID assignment is
a good way to get some provable limits, if subtransactions can be
overflowed somehow. For instance, we can keep track of the CSN during
last few tail updates, and convert any too old snapshots - this puts the
cap on the sparse buffer at O(num_backends*non-overflowed-sub-txs +
n_steps*tail-update-step) entries. When go try to find the new csnmin
and discover that a backend has a csnmin that is too old, we go through
the snapshots of that backend and convert every snapshot under the
desired csnmin to a traditional snapshot. To convert a snapshot to
traditional, we must find list of xids that were running when the CSN
was assigned. For this, we go through the sparse buffer and find all
xids between xmin and xmax that have CSN above that of the snapshots or
are still running. While doing that, we might as well update xmin and
xmax to tighter bounds, and maybe if xmax - xmin is small enough, store
it as a bitmap. When done, assign the array to the snapshot, issue a
write barrier and update a flag on the snapshot that tells it to use the
embedded running xids array when checking visibility. As with updating
the tail of the dense array, the status of this flag needs to be
rechecked after any CSN based visibility check to ensure correctness.
After we are sure that no snapshot exists above our new csnmin, we can
go through the sparse buffer and evict xids that are visible to all.
Moving of the tail needn’t be done under xidGenLock, a separate
denseHorizonUpdate lock would do. Commits would still block though,
unless we are willing to either make snapshots block or hand out stale
The sparse buffer needs to be a single updater bounded size lock free
datastructure. When inserting to it, we need to guarantee visibility
before dense array tail update. When removing from it, correct traversal
needs to be guaranteed and that no one can read it after the entry is
The dense buffer needs to be sized so that most writing transactions
complete before falling out of the ringbuffer and most snapshots are
released before converting. Something like num_backends*16 might be a
good start. This would use 4KB of memory for every 64 connections.
Under this sheme, taking a snapshot would be O(1), commiting would be
O(subtransactions), getting a new xid would be O(1 + num backends/dense
buffer step), if the buffer step is scaled with num backends, O(1)
amortized. Worst case seems to be num_backends - 1 with long read/write
transactions and one backend with very high write transaction rate.
I haven’t yet thought through how subtransactions and replication would
work with this scheme, but I feel that they don’t change any of the
pgsql-hackers by date
|Next:||From: Kevin Grittner||Date: 2011-09-08 13:47:35|
|Subject: Re: postgresql.conf archive_command example|
|Previous:||From: Aidan Van Dyk||Date: 2011-09-08 13:09:37|
|Subject: Re: postgresql.conf archive_command example|