Re: Transactions involving multiple postgres foreign servers

From: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To: hlinnaka(at)iki(dot)fi
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <kgrittn(at)ymail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Transactions involving multiple postgres foreign servers
Date: 2015-07-09 10:18:14
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

Hi All,
I have been working on improving the previous implementation and addressing
TODOs in my previous mail. Let me explain the approach first and I will get
to Heikki's comments later in the same mail.

The patch provides support for atomic commit for transactions involving
foreign servers. When a transaction makes changes to foreign servers,
either all the changes to all the foreign servers commit or rollback. We
should not see some changes committed and others rolled back.

Hooks and GUCs
The patch introduces a GUC atomic_foreign_transaction, which when ON
ensures atomic commit for foreign transactions, otherwise not. The value of
this GUC at the time of committing or preparing a local transaction is
used. This gives applications the flexibility to choose the behaviour as
late in the transaction as possible. This GUC has no effect if there are no
foreign servers involved in the transaction.

Another GUC max_fdw_transactions sets the maximum number of transactions
that can be simultaneously prepared on all the foreign servers. This limits
the memory required for remembering the prepared foreign transactions.

Two new FDW hooks are introduced for transaction management.
1. GetPrepareId: to get the prepared transaction identifier for a given
foreign server connection. An FDW which doesn't want to support this
feature can keep this hook undefined (NULL). When defined the hook should
return a unique identifier for the transaction prepared on the foreign
server. The identifier should be unique enough not to conflict with
currently prepared or future transactions. This point will be clear when
discussing phase 2 of 2PC.

2. HandleForeignTransaction: to end a transaction in specified way. The
hook should be able to prepare/commit/rollback current running transaction
on given connection or commit/rollback a previously prepared transaction.
This is described in detail while describing phase two of two-phase commit.
The function is required to return a boolean status of whether the
requested operation was successful or not. The function or its minions
should not raise any error on failure so as not to interfere with the
distributed transaction processing. This point will be clarified more in
the description below.

Achieving atomic commit
If atomic_foreign_transaction is enabled, two-commit protocol is used to
achieve atomic commit for transaction involving foreign servers. All the
foreign servers participating in such transaction should be capable of
participating in two-phase commit protocol. If not, the local and foreign
transactions are aborted as atomic commit can not be guaranteed.

Phase 1
Every FDW needs to register the connection while starting new transaction
on a foreign connection (RegisterXactForeignServer()). A foreign server
connection is identified by foreign server oid and the local user oid
(similar to the entry cached by postgres_fdw). While registering FDW also
tells whether the foreign server is capable of participating in two-phase
commit protocol. How to decide that is left entirely to the FDW. An FDW
like file_fdw may not have 2PC support at all, so all its foreign servers
do not comply with 2PC. An FDW might have all its servers 2PC compliant. An
FDW like postgres_fdw can have some of its servers compliant and some not,
depending upon server version, configuration (max_prepared_transactions =
0) etc. An FDW can decide not to register its connections at all and the
foreign servers belonging to that FDW will not be considered by the core at

During pre-commit processing following steps are executed
1. GetPrepareId hook is called on each of the connections registered to get
the identifier that will be used to prepare the transaction.
2. For each connection the prepared transaction id along with the
connection information, database id and local transaction id (xid) is
recorded in the memory.
3. This is logged in XLOG. If standby is configured, it is replayed on
standby. In case of master failover a standby is able to resolve in-doubt
prepared transactions created by the master.
4. The information is written to an on-disk file in pg_fdw_xact/ directory.
This directory contains one file per prepared transaction on foreign
connection. The file is fsynced during checkpoint similar to pg_twophase
files. The file management in this directory is similar to the way, files
are managed in pg_twophase.
5. HandleForeignTransaction is called to prepare the transaction on given
connection with the identifier provided by GetPrepareId().

If the server crashes after step 5, we will remember the transaction
prepared on the foreign server and will try to abort it after recovery. If
it crashes after step 3 and completion of 5, we will remember a transaction
that was never prepared and try to resolve it later. This scenario will be
described while describing phase 2.

If any of the steps fail including the PREPARE on the foreign server
itself, the local transaction will be aborted. All the prepared
transactions on foreign servers will be aborted as described in phase 2
discussion below. Yet to be prepared transactions are rolled back by using
the same hook. If step 5 fails, the prepared foreign transaction entry is
removed from memory and disk following steps 2,3,4 in phase 2.
HandleForeignTransaction throwing an error will interfere with this, so it
is not expected to throw an error.

If the transactions are prepared on all the foreign servers successfully,
we enter phase 2 of 2PC.

The local transaction is not required to be prepared per say.

Phase 2
After the local transaction has committed or aborted the foreign
transactions prepared as part of the local transaction as committed or
aborted resp. Committing or aborting prepared foreign transaction is
collectively henceforth termed as "resolving" for simplicity. Following
steps are executed while resolving a foreign prepared transaction.

1. Resolve the foreign prepared transaction on corresponding foreign server
using user mapping of local user used at the time of preparing the
transaction. This is done through hook HandleForeignTransaction().
2. If the resolution is successful, remove the prepared foreign transaction
entry from the memory
3. Log about removal of entry in XLOG. When this log is replayed during
recovery or in standby mode, it executes step 4 below.
4. Remove the corresponding file from pg_fdw_xact directory.

If the resolution is unsuccessful, leave the entry untouched. Since this
phase is carried out when no transaction exists, HandleForeignTransaction
should not throw an error and should be designed not to access database
while performing this operation.

In case server crashes after step 1 and before step 3, a resolved foreign
transaction will be considered unresolved when the local server recovers or
standby takes over the master. It will try to resolve the prepared
transaction again and should get an error from foreign server.
HandleForeignTransaction hook should treat this as normal and return true
since the prepared transaction is resolved (or rather there is nothing that
can be done). For such cases it is important that GetPrepareId returns a
transaction identifier which does not conflict with a future tansaction id,
lest we may resolve (may be with wrong outcome) a prepared transaction
which shouldn't be resolved.

Any crash or connection failure in phase 2 leaves the prepared transaction
in unresolved state.

Resolving unresolved foreign transactions
A local/foreign server crash or connection failure after a transaction is
prepared on the foreign server leaves that transaction in unresolved state.
The patch provides a built-in function pg_fdw_resolve() to resolve those
after recovering from the failure. This built-in scans all the prepared
transactions in-memory and decides the fate (commit/rollback) based on the
fate of local transaction that prepared it on the foreign server. It does
not touch entries corresponding to the in-progress local transactions. It
then executes the same steps as phase 2 to resolve the prepared foreign
transactions. Since foreign server information is contained within a
database, the function only touches the entries corresponding to the
database from which it is invoked. A user can configure a daemon or
cron-job to execute this function frequently from various databases.
Alternatively, user can use contrib module pg_fdw_xact_resolver which does
the same using background worker mechanism. This module needs to be
installed and listed in shared_preload_libraries to start the daemon
automatically on the startup.

A foreign server, user mapping corresponding to an unresolved foreign
transaction is not allowed to be dropped or altered until the foreign
transaction is resolved. This is required to retain the connection
properties which need to resolve the prepared transaction on the foreign

Crash recovery
During crash recovery, the files in pg_fdw_xact/ are created or removed
when corresponding WAL records are replayed. After the redo is done
pg_fdw_xact directory is scanned for unresolved foreign prepared
transactions. The files in this directory are named as triplet (xid,
foreign server oid, user oid) to create a unique name for each file. This
scan also emits the oldest transaction id with an unresolved prepared
foreign transactions. This affects oldest active transaction id, since the
status of this transaction id is required to decide the fate of unresolved
prepared foreign transaction.

On standby during WAL replay files are just created or removed. If the
standby is required to finish recovery and take over the master,
pg_fdw_xact is scanned to read unresolved foreign prepared transactions
into the shared memory.

Preparing transaction involving foreign server/s, on local server
While PREPARing a local transaction that involves foreign servers, the
transactions are prepared on the foreign server (as described in phase 1
above), if atomic_foreign_transaction is enabled. If the GUC is disabled,
such local transactions can not be prepared (as of this patch at least).
This also means that all the foreign servers participating in the
transaction to be prepared are required to support 2PC. While
committing/rolling back the prepared transaction the corresponding foreign
prepared transactions are committed or rolled back (as described in phase
2) resp. Any unresolved foreign transactions are resolved the same way as

View for checking the current foreign prepared transactions
A built-in function pg_fdw_xact() lists all the currently prepared foreign
transactions. This function does not list anything on standby while its
replaying WAL, since it doesn't have any entry in-memory. A convenient view
pg_fdw_xacts lists the same with the oids converted to the names.

Handling non-atomic foreign transactions
When atomic_foreign_transaction is disabled, one-phase commit protocol is
used to commit/rollback the foreign transactions. After the local
transaction has committed/aborted, all the foreign transactions on the
registered foreign connections are committed or aborted resp. using hook
HandleForeignTransaction. Failing to commit a foreign transaction does not
affect the other foreign transactions; they are still tried to be committed
(if the local transaction commits).

PITR may rewind the database to a point before an xid associated with an
unresolved foreign transaction. There are two approaches to deal with the
1. Just forget about the unresolved foreign transaction and remove the file
just like we do for a prepared local transaction. But then the prepared
transaction on the foreign server might be left unresolved forever and will
keep holding the resources.
2. Do not allow PITR to such point. We can not get rid of the transaction
id without getting rid of prepared foreign transaction. If we do so, we
might create conflicting files in future and might resolve the transaction
with wrong outcome.

Rest of the mail contains replies to Heikki's comments.

On Tue, Jul 7, 2015 at 2:55 PM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:

> On 02/17/2015 11:26 AM, Ashutosh Bapat wrote:
>> Hi All,
>> Here are the steps and infrastructure for achieving atomic commits across
>> multiple foreign servers. I have tried to address most of the concerns
>> raised in this mail thread before. Let me know, if I have left something.
>> Attached is a WIP patch implementing the same for postgres_fdw. I have
>> tried to make it FDW-independent.
> Wow, this is going to be a lot of new infrastructure. This is going to
> need good documentation, explaining how two-phase commit works in general,
> how it's implemented, how to monitor it etc. It's important to explain all
> the possible failure scenarios where you're left with in-doubt
> transactions, and how the DBA can resolve them.

I have included some documentation in the patch. Once we agree on the
functionality, design, I will improve the documentation further.

> Since we're building a Transaction Manager into PostgreSQL, please put a
> lot of thought on what kind of APIs it provides to the rest of the system.
> APIs for monitoring it, configuring it, etc. And how an extension could
> participate in a transaction, without necessarily being an FDW.
The patch has added all of it except extension thing. Let me know if
anything is missing.

Even today and extension can participate in a transaction by registering
transaction and subtransaction call backs. So, as long as an extension (and
so some FDW) does things such that the failures in those do not affect the
atomicity, they can use these callbacks. However, these call backs are not
enough to handle unresolved prepared transactions or handle connectivity
failures in the phase 2. The patch adds infrastructure to do that.

dblink might be something on your mind, but to support dblink here, it will
need too liberal format for storing information about the prepared
transactions on other servers. This format will vary from extension to
extension, and may not be very useful as above. What we might be able to do
is expose the functions for creating files for prepared transactions and
logging about them and let extension use them. BTW, dblink_plus already
supports 2PC for dblink.

> Regarding the configuration, there are many different behaviours that an
> FDW could implement:
> 1. The FDW is read-only. Commit/abort behaviour is moot.

I can think of two flavours of read-only FDW: 1. the underlying data is
read-only 2. the FDW is read-only but the underlying data is not.
In first case, the FDW may choose not to participate in the transaction
management at all, so doesn't register the foreign connections. Still the
rest of the transaction will be atomic.

In second case however, the writes to other foreign server may depend upon
what has been read from the read-only FDW esp. in repeatable read and
higher isolation levels. So it's important that the data once read remains
intact till the transaction commits or at least is prepared, implying we
have to start a transaction on the read-only foreign server. Once the other
foreign transactions get prepared, we might be able to commit the
transaction on read-only foreign server. That optimization is not yet
implemented by my patch. But it should be possible to do in the approach
taken by the patch. Can we leave that as a future enhancement?

Does that solve your concern?

> 2. Transactions are not supported. All updates happen immediately
> regardless of the local transaction.

An FDW can choose not to register its server and local PostgreSQL won't
know about it. Is that acceptable behaviour?

> 3. Transactions are supported, but two-phase commit is not. There are
> three different ways we can use the remote transactions in that case:

This case is supported by using GUC atomic_foreign_transaction. The patch
implements 3.2 approach.

> 3.1. Commit the remote transaction before local transaction.
> 3.2. Commit the remote transaction after local transaction.
> 3.3. As long as there is only one such FDW involved, we can still do safe
> two-phase commit using so-called Last Resource Optimization.

IIUC LRO, the patch uses the local transaction as last resource, which is
always present. The fate of foreign transaction is decided by the fate of
the local transaction, which is not required to be prepared per say. There
is more relevant note later.

> 4. Full two-phases commit support
> We don't necessarily have to support all of that, but let's keep all these
> cases in mind when we design the how to configure FDWs. There's more to it
> than "does it support 2PC".

> A. Steps during transaction processing
>> ------------------------------------------------
>> 1. When an FDW connects to a foreign server and starts a transaction, it
>> registers that server with a boolean flag indicating whether that server
>> is
>> capable of participating in a two phase commit. In the patch this is
>> implemented using function RegisterXactForeignServer(), which raises an
>> error, thus aborting the transaction, if there is at least one foreign
>> server incapable of 2PC in a multiserver transaction. This error thrown as
>> early as possible. If all the foreign servers involved in the transaction
>> are capable of 2PC, the function just updates the information. As of now,
>> in the patch the function is in the form of a stub.
>> Whether a foreign server is capable of 2PC, can be
>> a. FDW level decision e.g. file_fdw as of now, is incapable of 2PC but it
>> can build the capabilities which can be used for all the servers using
>> file_fdw
>> b. a decision based on server version type etc. thus FDW can decide that
>> by
>> looking at the server properties for each server
>> c. a user decision where the FDW can allow a user to specify it in the
>> form
>> of CREATE/ALTER SERVER option. Implemented in the patch.
>> For a transaction involving only a single foreign server, the current code
>> remains unaltered as two phase commit is not needed.
> Just to be clear: you also need two-phase commit if the transaction
> updated anything in the local server and in even one foreign server.

Any local transaction involving a foreign sever transaction uses two-phase
commit for the foreign transaction. The local transaction is not prepared
per say. However, we should be able to optimize a case, when there are no
local changes. I am not able to find a way to deduce that there was no
local change, so I have left that case in this patch. Is there a way to
know whether a local transaction changed something locally or not?

> D. Persistent and in-memory storage considerations
>> --------------------------------------------------------------------
>> I considered following options for persistent storage
>> 1. in-memory table and file(s) - The foreign transaction entries are saved
>> and manipulated in shared memory. They are written to file whenever
>> persistence is necessary e.g. while registering the foreign transaction in
>> step A.2. Requirements C.1, C.2 need some SQL interface in the form of
>> built-in functions or SQL commands.
>> The patch implements the in-memory foreign transaction table as a fixed
>> size array of foreign transaction entries (similar to prepared transaction
>> entries in twophase.c). This puts a restriction on number of foreign
>> prepared transactions that need to be maintained at a time. We need
>> separate locks to syncronize the access to the shared memory; the patch
>> uses only a single LW lock. There is restriction on the length of prepared
>> transaction id (or prepared transaction information saved by FDW to be
>> general), since everything is being saved in fixed size memory. We may be
>> able to overcome that restriction by writing this information to separate
>> files (one file per foreign prepared transaction). We need to take the
>> same
>> route as 2PC for C.5.
> Your current approach with a file that's flushed to disk on every update
> has a few problems. Firstly, it's not crash safe. Secondly, if you make it
> crash-safe with fsync(), performance will suffer. You're going to need to
> need several fsyncs per commit with 2PC anyway, there's no way around that,
> but the scalable way to do that is to use the WAL so that one fsync() can
> flush more than one update in one operation.
> So I think you'll need to do something similar to the pg_twophase files.
> WAL-log each update, and only flush the file/files to disk on a checkpoint.
> Perhaps you could use the pg_twophase infrastructure for this directly, by
> essentially treating every local transaction as a two-phase transaction,
> with some extra flag to indicate that it's an internally-created one.

I have used approach similar to pg_twophase, but implemented it as a
separate code, as the requirements differ. But, I would like to minimize
code by unifying both, if we finalise this design. Suggestions in this
regard will be very helpful.

> 2. New catalog - This method takes out the need to have separate method
>> for
>> C1, C5 and even C2, also the synchronization will be taken care of by row
>> locks, there will be no limit on the number of foreign transactions as
>> well
>> as the size of foreign prepared transaction information. But big problem
>> with this approach is that, the changes to the catalogs are atomic with
>> the
>> local transaction. If a foreign prepared transaction can not be aborted
>> while the local transaction is rolled back, that entry needs to retained.
>> But since the local transaction is aborting the corresponding catalog
>> entry
>> would become invisible and thus unavailable to the resolver (alas! we do
>> not have autonomous transaction support). We may be able to overcome this,
>> by simulating autonomous transaction through a background worker (which
>> can
>> also act as a resolver). But the amount of communication and
>> synchronization, might affect the performance.
> Or you could insert/update the rows in the catalog with xmin=FrozenXid,
> ignoring MVCC. Not sure how well that would work.

I am not aware how to do that. Do we have any precedence in the code.
Something like a reference implementation, which I can follow. It will help
to lift two restrictions
1. Restriction on the number of simultaneously prepared foreign
2. Restriction on the prepared transaction identifier length.

Obviously we may be able to shed a lot of code related to file managment,
lookup etc.

> 3. WAL records - Since the algorithm follows "write ahead of action", WAL
>> seems to be a possible way to persist the foreign transaction entries. But
>> WAL records can not be used for repeated scan as is required by the
>> foreign
>> transaction resolver. Also, replaying WALs is controlled by checkpoint, so
>> not all WALs are replayed. If a checkpoint happens after a foreign
>> prepared
>> transaction remains resolved, corresponding WALs will never be replayed,
>> thus causing the foreign prepared transaction to remain unresolved forever
>> without a clue. So, WALs alone don't seem to be a fit here.
> Right. The pg_twophase files solve that exact same issue.
> There is clearly a lot of work to do here.
I'm marking this as Returned with Feedback in the commitfest, I don't think
> more review is going to be helpful at this point.

That's sad. Hope people to review the patch and help it improve, even if
it's out of commitfest.

> - Heikki

Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment Content-Type Size
pg_fdw_transact.patch text/x-patch 196.6 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2015-07-09 10:29:14 Re: Improving log capture of TAP tests with IPC::Run
Previous Message David Rowley 2015-07-09 09:44:11 Re: Sharing aggregate states between different aggregate functions