Re: Support for N synchronous standby servers - take 2

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Support for N synchronous standby servers - take 2
Date: 2015-12-17 18:38:44
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

On Mon, Dec 14, 2015 at 2:57 PM, Kyotaro HORIGUCHI
<horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> Thank you for the new patch.
> At Wed, 9 Dec 2015 20:59:20 +0530, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote in <CAD21AoDcn1fToCcYRqpU6fMY1xnpDdAKDTcbhW1R9M1mPM0kZg(at)mail(dot)gmail(dot)com>
>> On Wed, Nov 18, 2015 at 2:06 PM, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>> > I agree with #3 way and the s_s_name format you suggested.
>> > I think that It's extensible and is tolerable for future changes.
>> > I'm going to implement the patch based on this idea if other hackers
>> > agree with this design.
>> Please find the attached draft patch which supports multi sync replication.
>> This patch adds a GUC parameter synchronous_replication_method, which
>> represent the method of synchronous replication.
>> [Design of replication method]
>> synchronous_replication_method has two values; 'priority' and
>> '1-priority' for now.
>> We can expand the kind of its value (e.g, 'quorum', 'json' etc) in the future.
>> * s_r_method = '1-priority'
>> This method is for backward compatibility, so the syntax of s_s_names
>> is same as today.
>> The behavior is same as well.
>> * s_r_method = 'priority'
>> This method is for multiple synchronous replication using priority method.
>> The syntax of s_s_names is,
>> <number of sync standbys>, <standby name> [, ...]
> Is there anyone opposed to this?
>> For example, s_r_method = 'priority' and s_s_names = '2, node1, node2,
>> node3' means that the master waits for acknowledge from at least 2
>> lowest priority servers.
>> If 4 standbys(node1 - node4) are available, the master server waits
>> acknowledge from 'node1' and 'node2.
>> The each status of wal senders are;
>> =# select application_name, sync_state from pg_stat_replication order
>> by application_name;
>> application_name | sync_state
>> ------------------+------------
>> node1 | sync
>> node2 | sync
>> node3 | potential
>> node4 | async
>> (4 rows)
>> After 'node2' crashed, the master will wait for acknowledge from
>> 'node1' and 'node3'.
>> The each status of wal senders are;
>> =# select application_name, sync_state from pg_stat_replication order
>> by application_name;
>> application_name | sync_state
>> ------------------+------------
>> node1 | sync
>> node3 | sync
>> node4 | async
>> (3 rows)
>> [Changing replication method]
>> When we want to change the replication method, we have to change the
>> s_r_method at first, and then do pg_reload_conf().
>> After changing replication method, we can change the s_s_names.

Thank you for reviewing the patch!
Please find attached latest patch.

> Mmm. I should be able to be changed at once, because s_r_method
> and s_s_names contradict each other during the intermediate
> state.

Sorry to confuse you. I meant the case where we want to change the
replication method using ALTER SYSTEM.

>> [Expanding replication method]
>> If we want to expand new replication method additionally, we need to
>> implement two functions for each replication method:
>> * int SyncRepGetSynchronousStandbysXXX(int *sync_standbys)
>> This function obtains the list of standbys considered as synchronous
>> at that time, and return its length.
>> * bool SyncRepGetSyncLsnXXX(XLogRecPtr *write_pos, XLogRecPtr *flush_pos)
>> This function obtains LSNs(write, flush) considered as synced.
>> Also, this patch debug code is remain yet, you can debug this behavior
>> using by enable DEBUG_REPLICATION macro.
>> Please give me feedbacks.
> I haven't looked into this fully (sorry) but I'm concerned about
> several points.
> - I feel that some function names looks too long. For example
> SyncRepGetSynchronousStandbysOnePriority occupies more than the
> half of a line. (However, the replication code alrady has many
> long function names..)

Yeah, it would be better to change 'Synchronous' to 'Sync' at least.

> - The comment below of SyncRepGetSynchronousStandbyOnePriority,
> > /* Find lowest priority standby */
> The code where the comment is for is doing the correct
> thing. Howerver, the comment is confusing. A lower priority
> *value* means a higher priority.


> - SyncRepGetSynchronousStandbys checks all if()s even when the
> first one matches. Use switch or "else if" there if you they
> are exclusive each other.


> - Do you intende the DEBUG_REPLICATION code in
> SyncRepGetSynchronousStandbys*() to be the final shape? The
> same code blocks which can work for both method should be in
> their common caller but SyncRepGetSyncLsns*() are
> headache. Although it might need more refactoring, I'm sorry
> but I don't see a desirable shape for now.

I'm not going to DEBUG_REPLICAION code to be the final shape.
These codes are removed from this version patch.

> By the way, palloc(20)/free() in such short term looks
> ineffective.
> - SyncRepGetSyncLsnsPriority
> For the comment "/* Find lowest XLogRecPtr of both write and
> flush from sync_nodes */", LSN is compared as early or late so
> the comment would be better to be something like "Keep/Collect
> the earliest write and flush LSNs among prioritized standbys".


> And what is more important, this block handles write and flush
> LSN jumbled and it reults in missing the earliest(= most
> delayed) LSN for certain cases. The following is an example.
> Standby 1: write LSN = 10, flush LSN = 5
> Standby 2: write LSN = 8 , flush LSN = 6
> For this case, finally we get tmp_write = 10 and tmp_flush = 5
> from the current code, where tmp_write has wrong value since
> LSN = 10 has *not* been written yet on standby 2. (the names
> "tmp_*" don't seem appropriate here)

You are right.
We have to handle write and flush LSN individually, and to get each lowest LSN.
For example in this case, we have to get write = 8, flush = 5.
I've change the logic so that.


Masahiko Sawada

Attachment Content-Type Size
000_multi_sync_replication_v3.patch application/octet-stream 17.7 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2015-12-17 18:43:27 psql - -dry-run option
Previous Message Robert Haas 2015-12-17 18:20:44 Re: PATCH: postpone building buckets to the end of Hash (in HashJoin)