Re: Synchronous replication patch built on SR

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Boszormenyi Zoltan <zb(at)cybertec(dot)at>
Cc: pgsql-hackers(at)postgresql(dot)org, Hans-Juergen Schoenig <hs(at)cybertec(dot)at>
Subject: Re: Synchronous replication patch built on SR
Date: 2010-05-14 11:56:11
Message-ID: AANLkTim58z2P1S69Z10ixmbdfPVltdl3_EndVy4Xtj_L@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

2010/4/29 Boszormenyi Zoltan <zb(at)cybertec(dot)at>:
> attached is a patch that does $SUBJECT, we are submitting it for 9.1.
> I have updated it to today's CVS after the "wal_level" GUC went in.

I'm planning to create the synchronous replication patch for 9.0, too.
My design is outlined in the wiki. Let's work together to do the design
of it.
http://wiki.postgresql.org/wiki/Streaming_Replication#Synchronization_capability

The log-shipping replication has some synchronization levels as follows.
Which are you going to work on?

The transaction commit on the master
#1 doesn't wait for replication (already suppored in 9.0)
#2 waits for WAL to be received by the standby
#3 waits for WAL to be received and flushed by the standby
#4 waits for WAL to be received, flushed and replayed by the standby
..etc?

I'm planning to add #2 and #3 into 9.1. #4 is useful but is outside
the scope of my development for at least 9.1. In #4, read-only query
can easily block recovery by the lock conflict and make the
transaction commit on the master get stuck. This problem is difficult
to be addressed within 9.1, I think. But the design and implementation
of #2 and #3 need to be easily extensible to #4.

> How does it work?
>
> First, the walreceiver and the walsender are now able to communicate
> in a duplex way on the same connection, so while COPY OUT is
> in progress from the primary server, the standby server is able to
> issue PQputCopyData() to pass the transaction IDs that were seen
> with XLOG_XACT_COMMIT or XLOG_XACT_PREPARE
> signatures. I did by adding a new protocol message type, with letter
> 'x' that's only acknowledged by the walsender process. The regular
> backend was intentionally unchanged so an SQL client gets a protocol
> error. A new libpq call called PQsetDuplexCopy() which sends this
> new message before sending START_REPLICATION. The primary
> makes a note of it in the walsender process' entry.
>
> I had to move the TransactionIdLatest(xid, nchildren, children) call
> that computes latestXid earlier in RecordTransactionCommit(), so
> it's in the critical section now, just before the
> XLogInsert(RM_XACT_ID, XLOG_XACT_COMMIT, rdata)
> call. Otherwise, there was a race condition between the primary
> and the standby server, where the standby server might have seen
> the XLOG_XACT_COMMIT record for some XIDs before the
> transaction in the primary server marked itself waiting for this XID,
> resulting in stuck transactions.

You seem to have chosen #4 as synchronization level. Right?

In your design, the transaction commit on the master waits for its XID
to be read from the XLOG_XACT_COMMIT record and replied by the standby.
Right? This design seems not to be extensible to #2 and #3 since
walreceiver cannot read XID from the XLOG_XACT_COMMIT record. How about
using LSN instead of XID? That is, the transaction commit waits until
the standby has reached its LSN. LSN is more easy-used for walreceiver
and startup process, I think.

What if the "synchronous" standby starts up from the very old backup?
The transaction on the master needs to wait until a large amount of
outstanding WAL has been applied? I think that synchronous replication
should start with *asynchronous* replication, and should switch to the
sync level after the gap between servers has become enough small.
What's your opinion?

> I have added 3 new options, two GUCs in postgresql.conf and one
> setting in recovery.conf. These options are:
>
> 1. min_sync_replication_clients = N
>
> where N is the number of reports for a given transaction before it's
> released as committed synchronously. 0 means completely asynchronous,
> the value is maximized by the value of max_wal_senders. Anything
> in between 0 and max_wal_senders means different levels of partially
> synchronous replication.
>
> 2. strict_sync_replication = boolean
>
> where the expected number of synchronous reports from standby
> servers is further limited to the actual number of connected synchronous
> standby servers if the value of this GUC is false. This means that if
> no standby servers are connected yet then the replication is asynchronous
> and transactions are allowed to finish without waiting for synchronous
> reports. If the value of this GUC is true, then transactions wait until
> enough synchronous standbys connect and report back.

Why are these options necessary?

Can these options cover more than three synchronization levels?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2010-05-14 12:06:33 Re: How to know killed by pg_terminate_backend
Previous Message Kevin Grittner 2010-05-14 10:56:37 Re: Row-level Locks & SERIALIZABLE transactions, postgres vs. Oracle