From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Vinayak Pokale <vinpokale(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
Subject:
Date: 2016-10-26 05:51:17
Message-ID: CAD21AoDQYR6LgrNfhaEgJOjtus4aNRCO7Bip9TNEGqGkiFmNmw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Oct 21, 2016 at 2:38 PM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> On Wed, Oct 19, 2016 at 9:17 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Thu, Oct 13, 2016 at 7:27 AM, Amit Langote
>> <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>>> However, when I briefly read the description in "Transaction Management in
>>> the R* Distributed Database Management System (C. Mohan et al)" [2], it
>>> seems that what Ashutosh is saying might be a correct way to proceed after
>>> all:
>>
>> I think Ashutosh is mostly right, but I think there's a lot of room to
>> doubt whether the design of this patch is good enough that we should
>> adopt it.
>>
>> Consider two possible designs. In design #1, the leader performs the
>> commit locally and then tries to send COMMIT PREPARED to every standby
>> server afterward, and only then acknowledges the commit to the client.
>> In design #2, the leader performs the commit locally and then
>> acknowledges the commit to the client at once, leaving the task of
>> running COMMIT PREPARED to some background process. Design #2
>> involves a race condition, because it's possible that the background
>> process might not complete COMMIT PREPARED on every node before the
>> user submits the next query, and that query might then fail to see
>> supposedly-committed changes. This can't happen in design #1. On the
>> other hand, there's always the possibility that the leader's session
>> is forcibly killed, even perhaps by pulling the plug. If the
>> background process contemplated by design #2 is well-designed, it can
>> recover and finish sending COMMIT PREPARED to each relevant server
>> after the next restart. In design #1, that background process doesn't
>> necessarily exist, so inevitably there is a possibility of orphaning
>> prepared transactions on the remote servers, which is not good. Even
>> if the DBA notices them, it won't be easy to figure out whether to
>> commit them or roll them back.
>>
>> I think this thought experiment shows that, on the one hand, there is
>> a point to waiting for commits on the foreign servers, because it can
>> avoid the anomaly of not seeing the effects of your own commits. On
>> the other hand, it's ridiculous to suppose that every case can be
>> handled by waiting, because that just isn't true. You can't be sure
>> that you'll be able to wait long enough for COMMIT PREPARED to
>> complete, and even if that works out, you may not want to wait
>> indefinitely for a dead server. Waiting for a ROLLBACK PREPARED has
>> no value whatsoever unless the system design is such that failing to
>> wait for it results in the ROLLBACK PREPARED never getting performed
>> -- which is a pretty poor excuse.
>>
>> Moreover, there are good reasons to think that doing this kind of
>> cleanup work in the post-commit hooks is never going to be acceptable.
>> Generally, the post-commit hooks need to be no-fail, because it's too
>> late to throw an ERROR. But there's very little hope that a
>> connection to a remote server can be no-fail; anything that involves a
>> network connection is, by definition, prone to failure. We can try to
>> guarantee that every single bit of code that runs in the path that
>> sends COMMIT PREPARED only raises a WARNING or NOTICE rather than an
>> ERROR, but that's going to be quite difficult to do: even palloc() can
>> throw an error. And what about interrupts? We don't want to be stuck
>> inside this code for a long time without any hope of the user
>> recovering control of the session by pressing ^C, but of course the
>> way that works is it throws an ERROR, which we can't handle here. We
>> fixed a similar issue for synchronous replication in
>> 9a56dc3389b9470031e9ef8e45c95a680982e01a by making an interrupt emit a
>> WARNING in that case and then return control to the user. But if we
>> do that here, all of the code that every FDW emits has to be aware of
>> that rule and follow it, and it just adds to the list of ways that the
>> user backend can escape this code without having cleaned up all of the
>> prepared transactions on the remote side.
>
> Hmm, IIRC, my patch and possibly patch by Masahiko-san and Vinayak,
> tries to resolve prepared transactions in post-commit code. I agree
> with you here, that it should be avoided and the backend should take
> over the job of resolving transactions.
>
>>
>> It seems to me that the only way to really make this feature robust is
>> to have a background worker as part of the equation. The background
>> worker launches at startup and looks around for local state that tells
>> it whether there are any COMMIT PREPARED or ROLLBACK PREPARED
>> operations pending that weren't completed during the last server
>> lifetime, whether because of a local crash or remote unavailability.
>> It attempts to complete those and retries periodically. When a new
>> transaction needs this type of coordination, it adds the necessary
>> crash-proof state and then signals the background worker. If
>> appropriate, it can wait for the background worker to complete, just
>> like a CHECKPOINT waits for the checkpointer to finish -- but if the
>> CHECKPOINT command is interrupted, the actual checkpoint is
>> unaffected.
>
> My patch and hence patch by Masahiko-san and Vinayak have the
> background worker in the equation. The background worker tries to
> resolve prepared transactions on the foreign server periodically.
> IIRC, sending it a signal when another backend creates foreign
> prepared transactions is not implemented. That may be a good addition.
>
>>
>> More broadly, the question has been raised as to whether it's right to
>> try to handle atomic commit and atomic visibility as two separate
>> problems. The XTM API proposed by Postgres Pro aims to address both
>> with a single stroke. I don't think that API was well-designed, but
>> maybe the idea is good even if the code is not. Generally, there are
>> two ways in which you could imagine that a distributed version of
>> PostgreSQL might work. One possibility is that one node makes
>> everything work by going around and giving instructions to the other
>> nodes, which are more or less unaware that they are part of a cluster.
>> That is basically the design of Postgres-XC and certainly the design
>> being proposed here. The other possibility is that the nodes are
>> actually clustered in some way and agree on things like whether a
>> transaction committed or what snapshot is current using some kind of
>> consensus protocol. It is obviously possible to get a fairly long way
>> using the first approach but it seems likely that the second one is
>> fundamentally more powerful: among other things, because the first
>> approach is so centralized, the leader is apt to become a bottleneck.
>> And, quite apart from that, can a centralized architecture with the
>> leader manipulating the other workers ever allow for atomic
>> visibility? If atomic visibility can build on top of atomic commit,
>> then it makes sense to do atomic commit first, but if we build this
>> infrastructure and then find that we need an altogether different
>> solution for atomic visibility, that will be unfortunate.
>>
>
> There are two problems to solve as far as visibility is concerned. 1.
> Consistency: changes by which transactions are visible to a given
> transaction 2. Making visible, the changes by all the segments of a
> given distributed transaction on different foreign servers, at the
> same time IOW no other transaction sees changes by only few segments
> but does not see changes by all the transactions.
>
> First problem is hard to solve and there are many consistency
> symantics. A large topic of discussion.
>
> The second problem can be solved on top of this infrastructure by
> extending PREPARE transaction API. I am writing down my ideas so that
> they don't get lost. It's not a completed design.
>
> Assume that we have syntax which tells the originating server which
> prepared the transaction. PREPARE TRANSACTION <GID> FOR SERVER <local
> server name> with ID <xid> ,where xid is the transaction identifier on
> local server. OR we may incorporate that information in GID itself and
> the foreign server knows how to decode it.
>
> Once we have that information, the foreign server can actively poll
> the local server to get the status of transaction xid and resolves the
> prepared transaction itself. It can go a step further and inform the
> local server that it has resolved the transaction, so that the local
> server can purge it from it's own state. It can remember the fate of
> xid, which can be consulted by another foreign server if the local
> server is down. If another transaction on the foreign server stumbles
> on a transaction prepared (but not resolved) by the local server,
> foreign server has two options - 1. consult the local server and
> resolve 2. if the first options fails to get the status of xid or that
> if that option is not workable, throw an error e.g. indoubt
> transaction. There is probably more network traffic happening here.
> Usually, the local server should be able to resolve the transaction
> before any other transaction stumbles upon it. The overhead is
> incurred only when necessary.
>

I think we can consider the atomic commit and the atomic visibility
separately, and the atomic visibility can build on the top of the
atomic commit. We can't provide the atomic visibility across multiple
nodes without consistent update. So I'd like to focus on atomic commit
in this thread. Considering to providing the atomic commit, the two
phase commit protocol is the perfect solution for providing atomic
commit. Whatever type of solution for atomic visibility we have, the
atomic commit by 2PC is necessary feature. We can consider to have the
atomic commit feature that ha following functionalities.
* The local node is responsible for the transaction management among
relevant remote servers using 2PC.
* The local node has information about the state of distributed
transaction state.
* There is a process resolving in-doubt transaction.

As Ashutosh mentioned, current patch supports almost these
functionalities. But I'm trying to update it so that it can have
multiple foreign server information into one FDWXact file, one entry
on shared buffer. Because in spite of that new remote server can be
added on the fly, we could need to restart local server in order to
allocate the more large shared buffer for fdw transaction whenever
remote server is added. Also I'm incorporating other comments.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Responses

  • Re: at 2016-10-26 05:55:48 from Masahiko Sawada

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2016-10-26 05:55:48 Re:
Previous Message Tsunakawa, Takayuki 2016-10-26 05:46:23 [bug fix] Stats collector is not restarted on the standby