Quick Links

Re:

From:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To:	Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Vinayak Pokale <vinpokale(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
Subject:	Re:
Date:	2016-10-26 05:55:48
Message-ID:	CAD21AoDxYZK84wjnLsxmXYrXC4_7LSKesKdx30qiDcDtp4AyOw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Oct 26, 2016 at 2:51 PM, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> On Fri, Oct 21, 2016 at 2:38 PM, Ashutosh Bapat
> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>> On Wed, Oct 19, 2016 at 9:17 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> On Thu, Oct 13, 2016 at 7:27 AM, Amit Langote
>>> <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>>>> However, when I briefly read the description in "Transaction Management in
>>>> the R* Distributed Database Management System (C. Mohan et al)" [2], it
>>>> seems that what Ashutosh is saying might be a correct way to proceed after
>>>> all:
>>>
>>> I think Ashutosh is mostly right, but I think there's a lot of room to
>>> doubt whether the design of this patch is good enough that we should
>>> adopt it.
>>>
>>> Consider two possible designs. In design #1, the leader performs the
>>> commit locally and then tries to send COMMIT PREPARED to every standby
>>> server afterward, and only then acknowledges the commit to the client.
>>> In design #2, the leader performs the commit locally and then
>>> acknowledges the commit to the client at once, leaving the task of
>>> running COMMIT PREPARED to some background process. Design #2
>>> involves a race condition, because it's possible that the background
>>> process might not complete COMMIT PREPARED on every node before the
>>> user submits the next query, and that query might then fail to see
>>> supposedly-committed changes. This can't happen in design #1. On the
>>> other hand, there's always the possibility that the leader's session
>>> is forcibly killed, even perhaps by pulling the plug. If the
>>> background process contemplated by design #2 is well-designed, it can
>>> recover and finish sending COMMIT PREPARED to each relevant server
>>> after the next restart. In design #1, that background process doesn't
>>> necessarily exist, so inevitably there is a possibility of orphaning
>>> prepared transactions on the remote servers, which is not good. Even
>>> if the DBA notices them, it won't be easy to figure out whether to
>>> commit them or roll them back.
>>>
>>> I think this thought experiment shows that, on the one hand, there is
>>> a point to waiting for commits on the foreign servers, because it can
>>> avoid the anomaly of not seeing the effects of your own commits. On
>>> the other hand, it's ridiculous to suppose that every case can be
>>> handled by waiting, because that just isn't true. You can't be sure
>>> that you'll be able to wait long enough for COMMIT PREPARED to
>>> complete, and even if that works out, you may not want to wait
>>> indefinitely for a dead server. Waiting for a ROLLBACK PREPARED has
>>> no value whatsoever unless the system design is such that failing to
>>> wait for it results in the ROLLBACK PREPARED never getting performed
>>> -- which is a pretty poor excuse.
>>>
>>> Moreover, there are good reasons to think that doing this kind of
>>> cleanup work in the post-commit hooks is never going to be acceptable.
>>> Generally, the post-commit hooks need to be no-fail, because it's too
>>> late to throw an ERROR. But there's very little hope that a
>>> connection to a remote server can be no-fail; anything that involves a
>>> network connection is, by definition, prone to failure. We can try to
>>> guarantee that every single bit of code that runs in the path that
>>> sends COMMIT PREPARED only raises a WARNING or NOTICE rather than an
>>> ERROR, but that's going to be quite difficult to do: even palloc() can
>>> throw an error. And what about interrupts? We don't want to be stuck
>>> inside this code for a long time without any hope of the user
>>> recovering control of the session by pressing ^C, but of course the
>>> way that works is it throws an ERROR, which we can't handle here. We
>>> fixed a similar issue for synchronous replication in
>>> 9a56dc3389b9470031e9ef8e45c95a680982e01a by making an interrupt emit a
>>> WARNING in that case and then return control to the user. But if we
>>> do that here, all of the code that every FDW emits has to be aware of
>>> that rule and follow it, and it just adds to the list of ways that the
>>> user backend can escape this code without having cleaned up all of the
>>> prepared transactions on the remote side.
>>
>> Hmm, IIRC, my patch and possibly patch by Masahiko-san and Vinayak,
>> tries to resolve prepared transactions in post-commit code. I agree
>> with you here, that it should be avoided and the backend should take
>> over the job of resolving transactions.
>>
>>>
>>> It seems to me that the only way to really make this feature robust is
>>> to have a background worker as part of the equation. The background
>>> worker launches at startup and looks around for local state that tells
>>> it whether there are any COMMIT PREPARED or ROLLBACK PREPARED
>>> operations pending that weren't completed during the last server
>>> lifetime, whether because of a local crash or remote unavailability.
>>> It attempts to complete those and retries periodically. When a new
>>> transaction needs this type of coordination, it adds the necessary
>>> crash-proof state and then signals the background worker. If
>>> appropriate, it can wait for the background worker to complete, just
>>> like a CHECKPOINT waits for the checkpointer to finish -- but if the
>>> CHECKPOINT command is interrupted, the actual checkpoint is
>>> unaffected.
>>
>> My patch and hence patch by Masahiko-san and Vinayak have the
>> background worker in the equation. The background worker tries to
>> resolve prepared transactions on the foreign server periodically.
>> IIRC, sending it a signal when another backend creates foreign
>> prepared transactions is not implemented. That may be a good addition.
>>
>>>
>>> More broadly, the question has been raised as to whether it's right to
>>> try to handle atomic commit and atomic visibility as two separate
>>> problems. The XTM API proposed by Postgres Pro aims to address both
>>> with a single stroke. I don't think that API was well-designed, but
>>> maybe the idea is good even if the code is not. Generally, there are
>>> two ways in which you could imagine that a distributed version of
>>> PostgreSQL might work. One possibility is that one node makes
>>> everything work by going around and giving instructions to the other
>>> nodes, which are more or less unaware that they are part of a cluster.
>>> That is basically the design of Postgres-XC and certainly the design
>>> being proposed here. The other possibility is that the nodes are
>>> actually clustered in some way and agree on things like whether a
>>> transaction committed or what snapshot is current using some kind of
>>> consensus protocol. It is obviously possible to get a fairly long way
>>> using the first approach but it seems likely that the second one is
>>> fundamentally more powerful: among other things, because the first
>>> approach is so centralized, the leader is apt to become a bottleneck.
>>> And, quite apart from that, can a centralized architecture with the
>>> leader manipulating the other workers ever allow for atomic
>>> visibility? If atomic visibility can build on top of atomic commit,
>>> then it makes sense to do atomic commit first, but if we build this
>>> infrastructure and then find that we need an altogether different
>>> solution for atomic visibility, that will be unfortunate.
>>>
>>
>> There are two problems to solve as far as visibility is concerned. 1.
>> Consistency: changes by which transactions are visible to a given
>> transaction 2. Making visible, the changes by all the segments of a
>> given distributed transaction on different foreign servers, at the
>> same time IOW no other transaction sees changes by only few segments
>> but does not see changes by all the transactions.
>>
>> First problem is hard to solve and there are many consistency
>> symantics. A large topic of discussion.
>>
>> The second problem can be solved on top of this infrastructure by
>> extending PREPARE transaction API. I am writing down my ideas so that
>> they don't get lost. It's not a completed design.
>>
>> Assume that we have syntax which tells the originating server which
>> prepared the transaction. PREPARE TRANSACTION <GID> FOR SERVER <local
>> server name> with ID <xid> ,where xid is the transaction identifier on
>> local server. OR we may incorporate that information in GID itself and
>> the foreign server knows how to decode it.
>>
>> Once we have that information, the foreign server can actively poll
>> the local server to get the status of transaction xid and resolves the
>> prepared transaction itself. It can go a step further and inform the
>> local server that it has resolved the transaction, so that the local
>> server can purge it from it's own state. It can remember the fate of
>> xid, which can be consulted by another foreign server if the local
>> server is down. If another transaction on the foreign server stumbles
>> on a transaction prepared (but not resolved) by the local server,
>> foreign server has two options - 1. consult the local server and
>> resolve 2. if the first options fails to get the status of xid or that
>> if that option is not workable, throw an error e.g. indoubt
>> transaction. There is probably more network traffic happening here.
>> Usually, the local server should be able to resolve the transaction
>> before any other transaction stumbles upon it. The overhead is
>> incurred only when necessary.
>>
>
> I think we can consider the atomic commit and the atomic visibility
> separately, and the atomic visibility can build on the top of the
> atomic commit. We can't provide the atomic visibility across multiple
> nodes without consistent update. So I'd like to focus on atomic commit
> in this thread. Considering to providing the atomic commit, the two
> phase commit protocol is the perfect solution for providing atomic
> commit. Whatever type of solution for atomic visibility we have, the
> atomic commit by 2PC is necessary feature. We can consider to have the
> atomic commit feature that ha following functionalities.
> * The local node is responsible for the transaction management among
> relevant remote servers using 2PC.
> * The local node has information about the state of distributed
> transaction state.
> * There is a process resolving in-doubt transaction.
>
> As Ashutosh mentioned, current patch supports almost these
> functionalities. But I'm trying to update it so that it can have
> multiple foreign server information into one FDWXact file, one entry
> on shared buffer. Because in spite of that new remote server can be
> added on the fly, we could need to restart local server in order to
> allocate the more large shared buffer for fdw transaction whenever
> remote server is added. Also I'm incorporating other comments.
>

The subject line is removed mistakenly.
Please ignore it.
Sorry for the noise.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

In response to

at 2016-10-26 05:51:17 from Masahiko Sawada

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Masahiko Sawada	2016-10-26 06:00:15	Re: Transactions involving multiple postgres foreign servers
Previous Message	Masahiko Sawada	2016-10-26 05:51:17