Re: Transactions involving multiple postgres foreign servers

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Kevin Grittner <kgrittn(at)ymail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Transactions involving multiple postgres foreign servers
Date: 2015-07-07 09:25:54
Message-ID: 559B9B22.1050800@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 02/17/2015 11:26 AM, Ashutosh Bapat wrote:
> Hi All,
>
> Here are the steps and infrastructure for achieving atomic commits across
> multiple foreign servers. I have tried to address most of the concerns
> raised in this mail thread before. Let me know, if I have left something.
> Attached is a WIP patch implementing the same for postgres_fdw. I have
> tried to make it FDW-independent.

Wow, this is going to be a lot of new infrastructure. This is going to
need good documentation, explaining how two-phase commit works in
general, how it's implemented, how to monitor it etc. It's important to
explain all the possible failure scenarios where you're left with
in-doubt transactions, and how the DBA can resolve them.

Since we're building a Transaction Manager into PostgreSQL, please put a
lot of thought on what kind of APIs it provides to the rest of the
system. APIs for monitoring it, configuring it, etc. And how an
extension could participate in a transaction, without necessarily being
an FDW.

Regarding the configuration, there are many different behaviours that an
FDW could implement:

1. The FDW is read-only. Commit/abort behaviour is moot.
2. Transactions are not supported. All updates happen immediately
regardless of the local transaction.
3. Transactions are supported, but two-phase commit is not. There are
three different ways we can use the remote transactions in that case:
3.1. Commit the remote transaction before local transaction.
3.2. Commit the remote transaction after local transaction.
3.3. As long as there is only one such FDW involved, we can still do
safe two-phase commit using so-called Last Resource Optimization.
4. Full two-phases commit support

We don't necessarily have to support all of that, but let's keep all
these cases in mind when we design the how to configure FDWs. There's
more to it than "does it support 2PC".

> A. Steps during transaction processing
> ------------------------------------------------
>
> 1. When an FDW connects to a foreign server and starts a transaction, it
> registers that server with a boolean flag indicating whether that server is
> capable of participating in a two phase commit. In the patch this is
> implemented using function RegisterXactForeignServer(), which raises an
> error, thus aborting the transaction, if there is at least one foreign
> server incapable of 2PC in a multiserver transaction. This error thrown as
> early as possible. If all the foreign servers involved in the transaction
> are capable of 2PC, the function just updates the information. As of now,
> in the patch the function is in the form of a stub.
>
> Whether a foreign server is capable of 2PC, can be
> a. FDW level decision e.g. file_fdw as of now, is incapable of 2PC but it
> can build the capabilities which can be used for all the servers using
> file_fdw
> b. a decision based on server version type etc. thus FDW can decide that by
> looking at the server properties for each server
> c. a user decision where the FDW can allow a user to specify it in the form
> of CREATE/ALTER SERVER option. Implemented in the patch.
>
> For a transaction involving only a single foreign server, the current code
> remains unaltered as two phase commit is not needed.

Just to be clear: you also need two-phase commit if the transaction
updated anything in the local server and in even one foreign server.

> D. Persistent and in-memory storage considerations
> --------------------------------------------------------------------
> I considered following options for persistent storage
> 1. in-memory table and file(s) - The foreign transaction entries are saved
> and manipulated in shared memory. They are written to file whenever
> persistence is necessary e.g. while registering the foreign transaction in
> step A.2. Requirements C.1, C.2 need some SQL interface in the form of
> built-in functions or SQL commands.
>
> The patch implements the in-memory foreign transaction table as a fixed
> size array of foreign transaction entries (similar to prepared transaction
> entries in twophase.c). This puts a restriction on number of foreign
> prepared transactions that need to be maintained at a time. We need
> separate locks to syncronize the access to the shared memory; the patch
> uses only a single LW lock. There is restriction on the length of prepared
> transaction id (or prepared transaction information saved by FDW to be
> general), since everything is being saved in fixed size memory. We may be
> able to overcome that restriction by writing this information to separate
> files (one file per foreign prepared transaction). We need to take the same
> route as 2PC for C.5.

Your current approach with a file that's flushed to disk on every update
has a few problems. Firstly, it's not crash safe. Secondly, if you make
it crash-safe with fsync(), performance will suffer. You're going to
need to need several fsyncs per commit with 2PC anyway, there's no way
around that, but the scalable way to do that is to use the WAL so that
one fsync() can flush more than one update in one operation.

So I think you'll need to do something similar to the pg_twophase files.
WAL-log each update, and only flush the file/files to disk on a
checkpoint. Perhaps you could use the pg_twophase infrastructure for
this directly, by essentially treating every local transaction as a
two-phase transaction, with some extra flag to indicate that it's an
internally-created one.

> 2. New catalog - This method takes out the need to have separate method for
> C1, C5 and even C2, also the synchronization will be taken care of by row
> locks, there will be no limit on the number of foreign transactions as well
> as the size of foreign prepared transaction information. But big problem
> with this approach is that, the changes to the catalogs are atomic with the
> local transaction. If a foreign prepared transaction can not be aborted
> while the local transaction is rolled back, that entry needs to retained.
> But since the local transaction is aborting the corresponding catalog entry
> would become invisible and thus unavailable to the resolver (alas! we do
> not have autonomous transaction support). We may be able to overcome this,
> by simulating autonomous transaction through a background worker (which can
> also act as a resolver). But the amount of communication and
> synchronization, might affect the performance.

Or you could insert/update the rows in the catalog with xmin=FrozenXid,
ignoring MVCC. Not sure how well that would work.

> 3. WAL records - Since the algorithm follows "write ahead of action", WAL
> seems to be a possible way to persist the foreign transaction entries. But
> WAL records can not be used for repeated scan as is required by the foreign
> transaction resolver. Also, replaying WALs is controlled by checkpoint, so
> not all WALs are replayed. If a checkpoint happens after a foreign prepared
> transaction remains resolved, corresponding WALs will never be replayed,
> thus causing the foreign prepared transaction to remain unresolved forever
> without a clue. So, WALs alone don't seem to be a fit here.

Right. The pg_twophase files solve that exact same issue.

There is clearly a lot of work to do here. I'm marking this as Returned
with Feedback in the commitfest, I don't think more review is going to
be helpful at this point.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2015-07-07 09:43:54 Re: Memory Accounting v11
Previous Message Petr Jelinek 2015-07-07 09:24:28 Re: BUG #13126: table constraint loses its comment