Re: The plan for FDW-based sharding

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Kevin Grittner <kgrittn(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: The plan for FDW-based sharding
Date: 2016-03-07 12:13:12
Message-ID: CAMsr+YFCPh4TWAPZts7Jdysgt6VOuRH+hyBC6g8bMLD_q0CQvQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 5 March 2016 at 23:41, Kevin Grittner <kgrittn(at)gmail(dot)com> wrote:

>
> > I'd be really interested in some ideas on how that information might be
> > usefully accessed. If we could write info on when to apply commits to the
> > xlog in serializable mode that'd be very handy, especially when looking
> to
> > the future with logical decoding of in-progress transactions, parallel
> > apply, etc.
>
> Are you suggesting the possibility of holding off on writing the
> commit record for a SERIALIZABLE transaction to WAL until it is
> known that no other SERIALIZABLE transaction comes ahead of it in
> the apparent order of execution? If so, that's an interesting idea
> that I hadn't given much thought to yet -- I had been assuming
> current WAL writes, with adjustments to the timing of application
> of the records.
>

I wasn't, I simply wrote less than clearly. I intended to say "from the
xlog" where I wrote "to the xlog". Nonetheless, that'd be a completely
unrelated but interesting thing to explore...

> > For parallel apply I anticipated that we'd probably have workers applying
> > xacts in parallel and committing them in upstream commit order. They'd
> > sometimes deadlock with each other; when this happened all workers whose
> > xacts committed after the first aborted xact would have to abort and
> start
> > again. Not ideal, but safe.
> >
> > Being able to avoid that by using SSI information was in the back of my
> > mind, but with no idea how to even begin to tackle it. What you've
> mentioned
> > here is helpful and I'd be interested if you could share a bit more of
> your
> > experience in the area.
>
> My thinking so far has been that reordering the application of
> transaction commits on a replica would best be done as the minimal
> rearrangement possible from commit order which allows the work of
> transactions to become visible in an order consistent with some
> one-at-a-time run of those transactions. Partly that is because
> the commit order is something that is fairly obvious to see and is
> what most people intuitively look at, even when it is wrong.
> Deviating from this intuitive order seems likely to introduce
> confusion, even when the results are 100% correct.
>

The only place you *need* to vary from commit order for correctness
> is when there are overlapping SERIALIZABLE transactions, one
> modifies data and commits, and another reads the old version of the
> data but commits later.

Ah, right. So here, even though X1 commits before X2 running concurrently
under SSI, the logical order in which the xacts could've occurred serially
is that where xact 2 runs and commits before X1, since xact 2 doesn't
depend on xact 1. X2 read the old row version before xact 1 modified it,
and logically occurs before xact1 in the serial rearrangement.

I don't fully grasp how that can lead to a situation where xacts can commit
in an order that's valid upstream but not valid as a downstream apply
order. I presume we're looking at read-only logical replicas here (rather
than multimaster), and it's only a concern for SERIALIZABLE xacts since a
READ COMMITTED xact on the master and replica would both be able to see the
state where X1 is commited but X2 isn't yet. But I don't see how a
read-only xact in SERIALIZABLE on the replica can get different results to
what it'd get with SSI on the master. It's entirely possible for a read
xact on the master to get a snapshot after X1 commits and after X2 commits,
same as READ COMMITTED. SSI shouldn't AFAIK come into play with no writes
to create a pivot. Is that wrong?

If we applied this sequence to the downstream in commit order we'd still
get correct results on the heap after applying both. We'd have an
intermediate state where X1 is commited but X2 isn't, but we can have the
same on the master. SSI doesn't AFAIK mask X1 from becoming visible in a
snapshot until X2 commits or anything, right?

> Due to the action of SSI on the source
> machine, you know that there could not be any SERIALIZABLE
> transaction which saw the inconsistent state between the two
> commits, but on replicas we don't yet manage that.

OK, maybe that's what I'm missing. How exactly does SSI ensure that? (A
RTFM link / hint is fine, but I didn't find it in the SSI section of TFM at
least in a way I recognised).

The key is that
> there is a read-write dependency (a/k/a rw-conflict) between the
> two transactions which tells you that the second to commit has to
> come before the first in any graph of apparent order of execution.
>

Yeah, I get that part. How does that stop a 3rd SERIALIZABLE xact from
getting a snapshot between the two commits and reading from there?

> The tricky part is that when there are two overlapping SERIALIZABLE
> transactions and one of them has modified data and committed, and
> there is an overlapping SERIALIZABLE transaction which is not READ
> ONLY which has not yet reached completion (COMMIT or ROLLBACK) the
> correct ordering remains in doubt -- there is no way to know which
> might need to commit first, or whether it even matters. I am
> skeptical about whether in logical replication (including MMR), it
> is going to be possible to manage this by finding "safe snapshots".
> The only alternative I can see, though, is to suspend replication
> while correct transaction ordering remains in doubt. A big READ
> ONLY transaction would not cause a replication stall, but a big
> READ WRITE transaction could cause an indefinite stall. Simon
> seemed to be saying that this is unacceptable, but I tend to think
> it is a viable approach for some workloads, especially if the READ
> ONLY transaction property is used when possible.
>

We already have huge replication stalls when big write xacts occur. We
don't start sending any data for the xact to a peer until it commits, and
once we start we don't send any other xact data until that xact is received
(and probably applied) by the peer.

I'd like to address that by introducing xact streaming / interleaved xacts,
where we stream big xacts on the wire as they occur and buffer them on the
peer, possibly speculatively applying them too. This requires that
individual row changes be tagged with subxact IDs and that
subxact-to-top-level-xact mapping info be sent, so the peer can accumulate
the right xacts into the right buffers. Basically offloading reorder
buffering to the peer.

That same mechanism would let replication continue while logical
serializable commit-order is in-doubt, blocking only the actual commit from
proceeding, and only on those xacts. I think.

That said I'm still clearly more fuzzy about the details of what SSI does,
what it guarantees and how it works than I thought I was, so I may just be
handwaving pointlessly at this point. I'd better read some code...

There might be some wiggle room in terms of letting
> non-SERIALIZABLE transactions commit while the ordering of
> SERIALIZABLE transactions remain in doubt, but that would involve
> allowing bigger deviations from commit order in transaction
> application, which may confuse people. The argument on the other
> side is that if they use transaction isolation less strict than
> SERIALIZABLE that they are vulnerable to seeing anomalies anyway,
> so they must be OK with that.
>

Yeah. I'd be inclined to do just that, and with that argument.

> Hopefully this is in some way helpful....
>

Very, thankyou.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2016-03-07 12:17:26 Re: How can we expand PostgreSQL ecosystem?
Previous Message Amit Kapila 2016-03-07 12:04:37 Re: ExecGather() + nworkers