|From:||Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>|
|To:||Robert Haas <robertmhaas(at)gmail(dot)com>|
|Cc:||Kevin Grittner <kgrittn(at)ymail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>|
|Subject:||Re: Transactions involving multiple postgres foreign servers|
|Views:||Raw Message | Whole Thread | Download mbox|
Here are the steps and infrastructure for achieving atomic commits across
multiple foreign servers. I have tried to address most of the concerns
raised in this mail thread before. Let me know, if I have left something.
Attached is a WIP patch implementing the same for postgres_fdw. I have
tried to make it FDW-independent.
A. Steps during transaction processing
1. When an FDW connects to a foreign server and starts a transaction, it
registers that server with a boolean flag indicating whether that server is
capable of participating in a two phase commit. In the patch this is
implemented using function RegisterXactForeignServer(), which raises an
error, thus aborting the transaction, if there is at least one foreign
server incapable of 2PC in a multiserver transaction. This error thrown as
early as possible. If all the foreign servers involved in the transaction
are capable of 2PC, the function just updates the information. As of now,
in the patch the function is in the form of a stub.
Whether a foreign server is capable of 2PC, can be
a. FDW level decision e.g. file_fdw as of now, is incapable of 2PC but it
can build the capabilities which can be used for all the servers using
b. a decision based on server version type etc. thus FDW can decide that by
looking at the server properties for each server
c. a user decision where the FDW can allow a user to specify it in the form
of CREATE/ALTER SERVER option. Implemented in the patch.
For a transaction involving only a single foreign server, the current code
remains unaltered as two phase commit is not needed. Rest of the discussion
pertains to a transaction involving more than one foreign servers.
At the commit or abort time, the FDW receives call backs with the
appropriate events. FDW then takes following actions on each event.
2. On XACT_EVENT_PRE_COMMIT event, the FDW coins one prepared transaction
id per foreign server involved and saves it along with xid, dbid, foreign
server id and user mapping and foreign transaction status = PREPARING
in-memory. The prepared transaction id can be anything represented as byte
string. Same information is flushed to the disk to survive crashes. This is
implemented in the patch as prepare_foreign_xact(). Persistent and
in-memory storages and their usages are discussed later in the mail. FDW
then prepares the transaction on the foreign server. If this step is
successful, the foreign transaction status is changed to PREPARED. If the
step is unsuccessful, the local transaction is aborted and each FDW will
receive XACT_EVENT_ABORT (discussed later). The updates to the foreign
transaction status need not be flushed to the disk, as they can be inferred
from the status of local transaction.
3. If the local transaction is committed, the FDW callback will get
XACT_EVENT_COMMIT event. Foreign transaction status is changed to
COMMITTING. FDW tries to commit the foreign transaction with the prepared
transaction id. If the commit is successful, the foreign transaction entry
is removed. If the commit is unsuccessful because of local/foreign server
crash or network failure, the foreign prepared transaction resolver takes
care of the committing it at later point of time.
4. If the local transaction is aborted, the FDW callback will get
XACT_EVENT_ABORT event. At this point, the FDW may or may not have prepared
a transaction on foreign server as per step 1 above. If it has not prepared
the transaction, it simply aborts the transaction on foreign server; a
server crash or network failure doesn't alter the ultimate result in this
case. If FDW has prepared the foreign transaction, it updates the foreign
transaction status as ABORTING and tries to rollback the prepared
transaction. If the rollback is successful, the foreign transaction entry
is removed. If the rollback is not successful, the foreign prepared
transaction resolver takes care of aborting it at later point of time.
B. Foreign prepared transaction resolver
In the patch this is implemented as a built-in function pg_fdw_resolve().
Ideally the functionality should be run by a background worker process
The resolver looks at each entry and invokes the FDW routine to resolve the
transaction. The FDW routine returns boolean status: true if the prepared
transaction was resolved (committed/aborted), false otherwise.
The resolution is as follows -
1. If foreign transaction status is COMMITTING or ABORTING, commits or
aborts the prepared transaction resp through the FDW routine. If the
transaction is successfully resolved, it removes the foreign transaction
2. Else, it checks if the local transaction was committed or aborted, it
update the foreign transaction status accordingly and takes the action
according to above step 1.
3. The resolver doesn't touch entries created by in-progress local
If server/backend crashes after it has registered the foreign transaction
entry (during step A.1), we will be left with a prepared transaction id,
which was never prepared on the foreign server. Similarly the
server/backend crashes after it has resolved the foreign prepared
transaction but before removing the entry, same situation can arise. FDW
should detect these situations, when foreign server complains about
non-existing prepared transaction ids and consider such foreign
transactions as resolved.
After looking at all the entries the resolver flushes the entries to the
disk, so as to retain the latest status across shutdown and crash.
C. Other methods and infrastructure
1. Method to show the current foreign transaction entries (in progress or
waiting to be resolved). Implemented as function pg_fdw_xact() in the patch.
2. Method to drop foreign transaction entries in case they are resolved by
user/DBA themselves. Not implemented in the patch.
3. Method to prevent altering or dropping foreign server and user mapping
used to prepare the foreign transaction till the later gets resolved. Not
implemented in the patch. While altering or dropping the foreign server or
user mapping, that portion of the code needs to check if there exists an
foreign transaction entry depending upon the foreign server or user mapping
and should error out.
4. The information about the xid needs to be available till it is decided
whether to commit or abort the foreign transaction and that decision is
persisted. That should put some constraint on the xid wraparound or oldest
active transaction. Not implemented in the patch.
5. Method to propagate the foreign transaction information to the slave.
D. Persistent and in-memory storage considerations
I considered following options for persistent storage
1. in-memory table and file(s) - The foreign transaction entries are saved
and manipulated in shared memory. They are written to file whenever
persistence is necessary e.g. while registering the foreign transaction in
step A.2. Requirements C.1, C.2 need some SQL interface in the form of
built-in functions or SQL commands.
The patch implements the in-memory foreign transaction table as a fixed
size array of foreign transaction entries (similar to prepared transaction
entries in twophase.c). This puts a restriction on number of foreign
prepared transactions that need to be maintained at a time. We need
separate locks to syncronize the access to the shared memory; the patch
uses only a single LW lock. There is restriction on the length of prepared
transaction id (or prepared transaction information saved by FDW to be
general), since everything is being saved in fixed size memory. We may be
able to overcome that restriction by writing this information to separate
files (one file per foreign prepared transaction). We need to take the same
route as 2PC for C.5.
2. New catalog - This method takes out the need to have separate method for
C1, C5 and even C2, also the synchronization will be taken care of by row
locks, there will be no limit on the number of foreign transactions as well
as the size of foreign prepared transaction information. But big problem
with this approach is that, the changes to the catalogs are atomic with the
local transaction. If a foreign prepared transaction can not be aborted
while the local transaction is rolled back, that entry needs to retained.
But since the local transaction is aborting the corresponding catalog entry
would become invisible and thus unavailable to the resolver (alas! we do
not have autonomous transaction support). We may be able to overcome this,
by simulating autonomous transaction through a background worker (which can
also act as a resolver). But the amount of communication and
synchronization, might affect the performance.
A mixed approach where the backend shifts the entries from storage in
approach 1 to catalog, thus lifting the constraints on size is possible,
but is very complicated.
Any other ideas to use catalog table as the persistent storage here? Does
anybody think, catalog table is a viable option?
3. WAL records - Since the algorithm follows "write ahead of action", WAL
seems to be a possible way to persist the foreign transaction entries. But
WAL records can not be used for repeated scan as is required by the foreign
transaction resolver. Also, replaying WALs is controlled by checkpoint, so
not all WALs are replayed. If a checkpoint happens after a foreign prepared
transaction remains resolved, corresponding WALs will never be replayed,
thus causing the foreign prepared transaction to remain unresolved forever
without a clue. So, WALs alone don't seem to be a fit here.
The algorithms rely on the FDWs to take right steps to the large extent,
rather than controlling each step explicitly. It expects the FDWs to take
the right steps for each event and call the right functions to manipulate
foreign transaction entries. It does not ensure the correctness of these
steps, by say examining the foreign transaction entries in response to each
event or by making the callback return the information and manipulate the
entries within the core. I am willing to go the stricter but more intrusive
route if the others also think that way. Otherwise, the current approach is
less intrusive and I am fine with that too.
The Postgres Database Company
|Next Message||happy times||2015-02-17 09:40:14||Re: restrict global access to be readonly|
|Previous Message||Andres Freund||2015-02-17 08:46:20||Re: pg_basebackup -x/X doesn't play well with archive_mode & wal_keep_segments|