Re: Proposal for CSN based snapshots

From: Ants Aasma <ants(dot)aasma(at)eesti(dot)ee>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Greg Stark <stark(at)mit(dot)edu>, Andres Freund <andres(at)2ndquadrant(dot)com>, Rajeev rastogi <rajeev(dot)rastogi(at)huawei(dot)com>, Markus Wanner <markus(at)bluegap(dot)ch>, Ants Aasma <ants(at)cybertec(dot)at>, Bruce Momjian <bruce(at)momjian(dot)us>, obartunov <obartunov(at)postgrespro(dot)ru>, Teodor Sigaev <teodor(at)postgrespro(dot)ru>
Subject: Re: Proposal for CSN based snapshots
Date: 2016-08-10 10:03:54
Message-ID: CA+CSw_vbt=CwLuOgR7gXdpnho_Y4Cz7X97+o_bH-RFo7keNO8Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Aug 9, 2016 at 3:16 PM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> (Reviving an old thread)
>
> I spent some time dusting off this old patch, to implement CSN snapshots.
> Attached is a new patch, rebased over current master, and with tons of
> comments etc. cleaned up. There's more work to be done here, I'm posting
> this to let people know I'm working on this again. And to have a backup on
> the 'net :-).

Great to hear. I hope to find some time to review this. In the
meanwhile here is a brain dump of my thoughts on the subject.

> I switched to using a separate counter for CSNs. CSN is no longer the same
> as the commit WAL record's LSN. While I liked the conceptual simplicity of
> CSN == LSN a lot, and the fact that the standby would see the same commit
> order as master, I couldn't figure out how to make async commits to work.

I had an idea how that would work. In short lastCommitSeqNo would be a
vector clock. Each tier of synchronization requiring different amount
of waiting between xlog insert and visibility would get a position in
the vector. Commits then need to be ordered only with respect to other
commits within the same tier. The vector position also needs to be
stored in the CSN log, probably by appropriating a couple of bits from
the CSN value.

I'm not sure if doing it this way would be worth the complexity, but
as far as I can see it would work. However I think keeping CSN and LSN
separate is a good idea, as it keeps the door open to using an
external source for the CSN value, enabling stuff like cheap multi
master globally consistent snapshots.

> Next steps:
>
> * Hot standby feedback is broken, now that CSN != LSN again. Will have to
> switch this back to using an "oldest XID", rather than a CSN.
>
> * I plan to replace pg_subtrans with a special range of CSNs in the csnlog.
> Something like, start the CSN counter at 2^32 + 1, and use CSNs < 2^32 to
> mean "this is a subtransaction, parent is XXX". One less SLRU to maintain.

That's a great idea. I had a similar idea to enable lock-free writing
into data structures, have 2^32 COMMITSEQNO_INPROGRESS values that
embed the xid owning the slot. This way a writer can just blindly CAS
into data structures without worrying about overwriting useful stuff
due to race conditions.

> * Put per-proc xmin back into procarray. I removed it, because it's not
> necessary for snapshots or GetOldestSnapshot() (which replaces
> GetOldestXmin()) anymore. But on second thoughts, we still need it for
> deciding when it's safe to truncate the csnlog.
>
> * In this patch, HeapTupleSatisfiesVacuum() is rewritten to use an "oldest
> CSN", instead of "oldest xmin", but that's not strictly necessary. To limit
> the size of the patch, I might revert those changes for now.
>
> * Rewrite the way RecentGlobalXmin is updated. As Alvaro pointed out in his
> review comments two years ago, that was quite complicated. And I'm worried
> that the lazy scheme I had might not allow pruning fast enough. I plan to
> make it more aggressive, so that whenever the currently oldest transaction
> finishes, it's responsible for advancing the "global xmin" in shared memory.
> And the way it does that, is by scanning the csnlog, starting from the
> current "global xmin", until the next still in-progress XID. That could be a
> lot, if you have a very long-running transaction that ends, but we'll see
> how it performs.

I had a similar idea. This would be greatly improved by a hybrid
snapshot scheme that could make the tail end of the CSN log a lot
sparser. More on that below.

> * Performance testing. Clearly this should have a performance benefit, at
> least under some workloads, to be worthwhile. And not regress.

I guess this is the main question. GetSnapshotData is probably going
to be faster, but the second important benefit that I saw was the
possibility of reducing locking around common activities. This of
course needs to guided by profiling - evils of premature optimization
and all that. But that said, my gut tells me that the most important
spots are:

* Looking up CSNs for XidVisibleInSnashot. Current version gets a
shared lock on CSNLogControlLock. That might get nasty, for example
when two concurrent sequential scans running on different cores or
sockets hit a patch of transactions within their xmin-xmax interval.

* GetSnapshotData. With tps on read only queries potentially pushing
into the millions, it would be awfully nice if contending on a single
cache line could be avoided. Currently it acquires ProcArrayLock and
CommitSeqNoLock to atomically update MyPgXact->snapshotcsn. The
pattern I had in mind to make this lock free would be to read a CSN,
publish value, check against latest OldestGlobalSnapshot value and
loop if necessary, and on the other side, calculate optimistic oldest
snapshot, publish and then recheck, if necessary, republish.

* While commits are not as performance critical it may still be
beneficial to defer clog updates and perform them in larger batches to
reduce clog lock traffic.

I think I have achieved some clarity on my idea of snapshots that
migrate between being CSN and XID based. The essential problem with
old CSN based snapshots is that the CSN log for their xmin-xmax
interval needs to be kept around in a quick to access datastructure.
And the problem with long running transactions is twofold, first is
that they need a well defined location where to publish the visibility
info for their xids and secondly, they enlarge the xmin-xmax interval
of all concurrent snapshots, needing a potentially huge amount of CSN
log to be kept around. One way to deal with it is to just accept it
and use a SLRU as in this patch.

My new insight is that a snapshot doesn't need to be either-or CSN or
XIP (xid in progress) array based, but it can also be a hybrid. There
would be a sorted array of in progress xids and a non-overlapping CSN
xmin-xmax interval where CSN log needs to be consulted. As snapshots
age they scan the CSN log and incrementally build their XIP array,
basically lazily constructing same data structure used in snapshots
today. Old in progress xids need a slot for themselves to be kept
around, but all new snapshots being taken would immediately classify
them as old, store them in XIP and not include them in their CSN xmin.
Under this scheme the amount of CSN log strictly needed is a
reasonably sized ring buffer for recent xids (probably sized based on
max conn), a sparse map for long transactions (bounded by max conn)
and some slack for snapshots still being converted. A SLRU or similar
is still needed because there is no way to ensure timely conversion of
all snapshots, but by properly timing the notification the probability
of actually using it should be extremely low. Datastructure
maintenance operations occur naturally on xid assignment and are
easily batched. Long running transactions still affect all snapshots,
but the effect is small and not dependent on transaction age, old
snapshots only affect themselves. Subtransactions are still a pain,
and would need something similar to the current suboverflowed scheme.
On the plus side, it seems like the number of subxids to overflow
could be global, not per backend.

In short:

* A directly mapped ring buffer for recent xids could be accessed lock
free, would take care most of the traffic in most of the cases. Looks
like it would be a good trade-off for complexity/performance.

* To keep workloads with wildly varying transaction lengths in bounded
amount of memory, a significantly more complex hybrid snapshot scheme
is needed. It remains to be seen if these workloads are actually a
significant issue. But if they are, the approach described might
provide a way out.

Regards,
Ants Aasma

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2016-08-10 10:18:19 Re: Declarative partitioning
Previous Message Pavel Stehule 2016-08-10 09:25:01 Re: Curing plpgsql's memory leaks for statement-lifespan values