Re: Logical replication and multimaster

From: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical replication and multimaster
Date: 2015-12-02 20:18:16
Message-ID: 565F5208.3070100@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thank you for reply.

On 12/02/2015 08:30 PM, Robert Haas wrote:
>
> Logical decoding only begins decoding a transaction once the
> transaction is complete. So I would guess that the sequence of
> operations here is something like this - correct me if I'm wrong:
>
> 1. Do the transaction.
> 2. PREPARE.
> 3. Replay the transaction.
> 4. PREPARE the replay.
> 5. COMMIT PREPARED on original machine.
> 6. COMMIT PREPARED on replica.

Logical decoding is started after execution of XLogFlush method.
So atually transaction is not yet completed at this moment:
- it is not marked as committed in clog
- It is marked as in-progress in procarray
- locks are not released

We are not using PostgreSQL two-phase commit here.
Instead of our DTM catches control in TransactionIdCommitTree and sends request to arbiter which in turn wait status of committing transactions on replicas.
The problem is that transactions are delivered to replica through single channel: logical replication slot.
And while such transaction is waiting acknowledgement from arbiter, it is blocking replication channel preventing other (parallel transactions) from been replicated and applied.

I have implemented pool of background workers. May be it will be useful not only for me.
It consists of one produces-multiple consumers queue implemented using buffer in shared memory, spinlock and two semaphores.
API is very simple:

typedef void(*BgwPoolExecutor)(int id, void* work, size_t size);
typedef BgwPool*(*BgwPoolConstructor)(void);

extern void BgwPoolStart(int nWorkers, BgwPoolConstructor constructor);
extern void BgwPoolInit(BgwPool* pool, BgwPoolExecutor executor, char const* dbname, size_t queueSize);
extern void BgwPoolExecute(BgwPool* pool, void* work, size_t size);

You just place in this queue some bulk of bytes (work, size), it is placed in queue and then first available worker will dequeue it and execute.

Using this pool and larger number of accounts (reducing possibility of conflict), I get better results.
So now receiver of logical replication is not executing transactions directly, instead of it receiver is placing them in queue and them are executed concurrent by pool of background workers.

At cluster with three nodes results of out debit-credit benchmark are the following:

TPS
Multimaster (ACID transactions)
12500
Multimaster (async replication)
34800
Standalone PostgreSQL
44000

We tested two modes: when client randomly distribute queries between cluster nodes and when client is working only with one master nodes and other are just used as replicas. Performance is slightly better in the second case, but the difference is not very
large (about 11000 TPS in first case).

Number of workers in pool has signficant imact on performance: with 8 workers we get about 7800 TPS and with 16 workers - 12500.
Also performance greatly depends on number of accounts (and so probability of lock conflicts). In case of 100 accounts speed is less than 1000 TPS.

> Step 3 introduces latency proportional to the amount of work the
> transaction did, which could be a lot. If you were doing synchronous
> physical replication, the replay of the COMMIT record would only need
> to wait for the replay of the commit record itself. But with
> synchronous logical replication, you've got to wait for the replay of
> the entire transaction. That's a major bummer, especially if replay
> is single-threaded and there a large number of backends generating
> transactions. Of course, the 2PC dance itself can also add latency -
> that's most likely to be the issue if the transactions are each very
> short.
>
> What I'd suggest is trying to measure where the latency is coming
> from. You should be able to measure how much time each transaction
> spends (a) executing, (b) preparing itself, (c) waiting for the replay
> thread to begin replaying it, (d) waiting for the replay thread to
> finish replaying it, and (e) committing. Separating (c) and (d) might
> be a little bit tricky, but I bet it's worth putting some effort in,
> because the answer is probably important to understanding what sort of
> change will help here. If (c) is the problem, you might be able to
> get around it by having multiple processes, though that only helps if
> applying is slower than decoding. But if (d) is the problem, then the
> only solution is probably to begin applying the transaction
> speculatively before it's prepared/committed. I think.
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Julian Schauder 2015-12-02 20:22:07 proposal: add 'waiting for replication' to pg_stat_activity.state
Previous Message Merlin Moncure 2015-12-02 20:01:29 Re: Some questions about the array.