Re: Proposal for CSN based snapshots

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Ants Aasma <ants(dot)aasma(at)eesti(dot)ee>
Cc: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Greg Stark <stark(at)mit(dot)edu>, Andres Freund <andres(at)2ndquadrant(dot)com>, Rajeev rastogi <rajeev(dot)rastogi(at)huawei(dot)com>, Markus Wanner <markus(at)bluegap(dot)ch>, Ants Aasma <ants(at)cybertec(dot)at>, Bruce Momjian <bruce(at)momjian(dot)us>, obartunov <obartunov(at)postgrespro(dot)ru>, Teodor Sigaev <teodor(at)postgrespro(dot)ru>
Subject: Re: Proposal for CSN based snapshots
Date: 2016-08-10 11:10:11
Message-ID: 7adf200e-b9ed-1de7-10df-0478ddd38114@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 08/10/2016 01:03 PM, Ants Aasma wrote:
> On Tue, Aug 9, 2016 at 3:16 PM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>> (Reviving an old thread)
>>
>> I spent some time dusting off this old patch, to implement CSN snapshots.
>> Attached is a new patch, rebased over current master, and with tons of
>> comments etc. cleaned up. There's more work to be done here, I'm posting
>> this to let people know I'm working on this again. And to have a backup on
>> the 'net :-).
>
> Great to hear. I hope to find some time to review this. In the
> meanwhile here is a brain dump of my thoughts on the subject.

Thanks!

>> * Performance testing. Clearly this should have a performance benefit, at
>> least under some workloads, to be worthwhile. And not regress.
>
> I guess this is the main question. GetSnapshotData is probably going
> to be faster, but the second important benefit that I saw was the
> possibility of reducing locking around common activities. This of
> course needs to guided by profiling - evils of premature optimization
> and all that.

Yep.

For the record, one big reason I'm excited about this is that a snapshot
can be represented as small fixed-size value. And the reason that's
exciting is that it makes it more feasible for each backend to publish
their snapshots, which in turn makes it possible to vacuum old tuples
more aggressively. In particular, if you have one very long-running
transaction (think pg_dump), and a stream of transactions, you could
vacuum away tuples that were inserted after the long-running transaction
started and deleted before any of the shorter transactions started. You
could do that without CSN snapshots, but it's a lot simpler with them.
But that's a separate story.

> But that said, my gut tells me that the most important spots are:
>
> * Looking up CSNs for XidVisibleInSnashot. Current version gets a
> shared lock on CSNLogControlLock. That might get nasty, for example
> when two concurrent sequential scans running on different cores or
> sockets hit a patch of transactions within their xmin-xmax interval.

I'm not too worried about that particular case. Shared LWLocks scale
pretty well these days, thanks to Andres Freund's work in this area. A
mix of updates and reads on the csnlog might become a scalability
problem though. We'll find out once we start testing.

> * GetSnapshotData. With tps on read only queries potentially pushing
> into the millions, it would be awfully nice if contending on a single
> cache line could be avoided. Currently it acquires ProcArrayLock and
> CommitSeqNoLock to atomically update MyPgXact->snapshotcsn. The
> pattern I had in mind to make this lock free would be to read a CSN,
> publish value, check against latest OldestGlobalSnapshot value and
> loop if necessary, and on the other side, calculate optimistic oldest
> snapshot, publish and then recheck, if necessary, republish.

Yeah, I haven't spent any serious effort into optimizing this yet. For
the final version, should definitely do something like that.

> * While commits are not as performance critical it may still be
> beneficial to defer clog updates and perform them in larger batches to
> reduce clog lock traffic.

Hmm. I doubt batching them is going to help much. But it should be
straightforward to use atomic operations here too, to reduce locking.

> I think I have achieved some clarity on my idea of snapshots that
> migrate between being CSN and XID based. The essential problem with
> old CSN based snapshots is that the CSN log for their xmin-xmax
> interval needs to be kept around in a quick to access datastructure.
> And the problem with long running transactions is twofold, first is
> that they need a well defined location where to publish the visibility
> info for their xids and secondly, they enlarge the xmin-xmax interval
> of all concurrent snapshots, needing a potentially huge amount of CSN
> log to be kept around. One way to deal with it is to just accept it
> and use a SLRU as in this patch.
>
> My new insight is that a snapshot doesn't need to be either-or CSN or
> XIP (xid in progress) array based, but it can also be a hybrid. There
> would be a sorted array of in progress xids and a non-overlapping CSN
> xmin-xmax interval where CSN log needs to be consulted. As snapshots
> age they scan the CSN log and incrementally build their XIP array,
> basically lazily constructing same data structure used in snapshots
> today. Old in progress xids need a slot for themselves to be kept
> around, but all new snapshots being taken would immediately classify
> them as old, store them in XIP and not include them in their CSN xmin.
> Under this scheme the amount of CSN log strictly needed is a
> reasonably sized ring buffer for recent xids (probably sized based on
> max conn), a sparse map for long transactions (bounded by max conn)
> and some slack for snapshots still being converted. A SLRU or similar
> is still needed because there is no way to ensure timely conversion of
> all snapshots, but by properly timing the notification the probability
> of actually using it should be extremely low. Datastructure
> maintenance operations occur naturally on xid assignment and are
> easily batched. Long running transactions still affect all snapshots,
> but the effect is small and not dependent on transaction age, old
> snapshots only affect themselves. Subtransactions are still a pain,
> and would need something similar to the current suboverflowed scheme.
> On the plus side, it seems like the number of subxids to overflow
> could be global, not per backend.

Yeah, if the csnlog access turns out to be too expensive, we could do
something like this. In theory, you can always convert a CSN snapshot
into an old-style list of XIDs, by scanning the csnlog between the xmin
and xmax. That could be expensive if the distance between xmin and xmax
is large, of course. But as you said, you can have various hybrid forms,
where you use a list of XIDs of some range as a cache, for example.

I'm hopeful that we can simply make the csnlog access fast enough,
though. Looking up an XID in a sorted array is O(log n), while looking
up an XID in the csnlog is O(1). That ignores all locking and different
constant factors, of course, but it's not a given that accessing the
csnlog has to be slower than a binary search of an XID array.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2016-08-10 11:33:25 Re: multivariate statistics (v19)
Previous Message Amit Langote 2016-08-10 11:09:42 Declarative partitioning - another take