Re: Support for N synchronous standby servers - take 2

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Support for N synchronous standby servers - take 2
Date: 2015-12-17 18:38:44
Message-ID: CAD21AoAqaZOHqE9_ALhQ7D3NStJmwDg1MK91opYDiDDd4r7HUg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

On Mon, Dec 14, 2015 at 2:57 PM, Kyotaro HORIGUCHI
<horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> Thank you for the new patch.
>
> At Wed, 9 Dec 2015 20:59:20 +0530, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote in <CAD21AoDcn1fToCcYRqpU6fMY1xnpDdAKDTcbhW1R9M1mPM0kZg(at)mail(dot)gmail(dot)com>
>> On Wed, Nov 18, 2015 at 2:06 PM, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>> > I agree with #3 way and the s_s_name format you suggested.
>> > I think that It's extensible and is tolerable for future changes.
>> > I'm going to implement the patch based on this idea if other hackers
>> > agree with this design.
>>
>> Please find the attached draft patch which supports multi sync replication.
>> This patch adds a GUC parameter synchronous_replication_method, which
>> represent the method of synchronous replication.
>>
>> [Design of replication method]
>> synchronous_replication_method has two values; 'priority' and
>> '1-priority' for now.
>> We can expand the kind of its value (e.g, 'quorum', 'json' etc) in the future.
>>
>> * s_r_method = '1-priority'
>> This method is for backward compatibility, so the syntax of s_s_names
>> is same as today.
>> The behavior is same as well.
>>
>> * s_r_method = 'priority'
>> This method is for multiple synchronous replication using priority method.
>> The syntax of s_s_names is,
>> <number of sync standbys>, <standby name> [, ...]
>
> Is there anyone opposed to this?
>
>> For example, s_r_method = 'priority' and s_s_names = '2, node1, node2,
>> node3' means that the master waits for acknowledge from at least 2
>> lowest priority servers.
>> If 4 standbys(node1 - node4) are available, the master server waits
>> acknowledge from 'node1' and 'node2.
>> The each status of wal senders are;
>>
>> =# select application_name, sync_state from pg_stat_replication order
>> by application_name;
>> application_name | sync_state
>> ------------------+------------
>> node1 | sync
>> node2 | sync
>> node3 | potential
>> node4 | async
>> (4 rows)
>>
>> After 'node2' crashed, the master will wait for acknowledge from
>> 'node1' and 'node3'.
>> The each status of wal senders are;
>>
>> =# select application_name, sync_state from pg_stat_replication order
>> by application_name;
>> application_name | sync_state
>> ------------------+------------
>> node1 | sync
>> node3 | sync
>> node4 | async
>> (3 rows)
>>
>> [Changing replication method]
>> When we want to change the replication method, we have to change the
>> s_r_method at first, and then do pg_reload_conf().
>> After changing replication method, we can change the s_s_names.

Thank you for reviewing the patch!
Please find attached latest patch.

> Mmm. I should be able to be changed at once, because s_r_method
> and s_s_names contradict each other during the intermediate
> state.

Sorry to confuse you. I meant the case where we want to change the
replication method using ALTER SYSTEM.

>> [Expanding replication method]
>> If we want to expand new replication method additionally, we need to
>> implement two functions for each replication method:
>> * int SyncRepGetSynchronousStandbysXXX(int *sync_standbys)
>> This function obtains the list of standbys considered as synchronous
>> at that time, and return its length.
>> * bool SyncRepGetSyncLsnXXX(XLogRecPtr *write_pos, XLogRecPtr *flush_pos)
>> This function obtains LSNs(write, flush) considered as synced.
>>
>> Also, this patch debug code is remain yet, you can debug this behavior
>> using by enable DEBUG_REPLICATION macro.
>>
>> Please give me feedbacks.
>
> I haven't looked into this fully (sorry) but I'm concerned about
> several points.
>
>
> - I feel that some function names looks too long. For example
> SyncRepGetSynchronousStandbysOnePriority occupies more than the
> half of a line. (However, the replication code alrady has many
> long function names..)

Yeah, it would be better to change 'Synchronous' to 'Sync' at least.

> - The comment below of SyncRepGetSynchronousStandbyOnePriority,
> > /* Find lowest priority standby */
>
> The code where the comment is for is doing the correct
> thing. Howerver, the comment is confusing. A lower priority
> *value* means a higher priority.

Fixed.

> - SyncRepGetSynchronousStandbys checks all if()s even when the
> first one matches. Use switch or "else if" there if you they
> are exclusive each other.

Fixed.

> - Do you intende the DEBUG_REPLICATION code in
> SyncRepGetSynchronousStandbys*() to be the final shape? The
> same code blocks which can work for both method should be in
> their common caller but SyncRepGetSyncLsns*() are
> headache. Although it might need more refactoring, I'm sorry
> but I don't see a desirable shape for now.

I'm not going to DEBUG_REPLICAION code to be the final shape.
These codes are removed from this version patch.

> By the way, palloc(20)/free() in such short term looks
> ineffective.
>
> - SyncRepGetSyncLsnsPriority
>
> For the comment "/* Find lowest XLogRecPtr of both write and
> flush from sync_nodes */", LSN is compared as early or late so
> the comment would be better to be something like "Keep/Collect
> the earliest write and flush LSNs among prioritized standbys".

Fixed.

> And what is more important, this block handles write and flush
> LSN jumbled and it reults in missing the earliest(= most
> delayed) LSN for certain cases. The following is an example.
>
> Standby 1: write LSN = 10, flush LSN = 5
> Standby 2: write LSN = 8 , flush LSN = 6
>
> For this case, finally we get tmp_write = 10 and tmp_flush = 5
> from the current code, where tmp_write has wrong value since
> LSN = 10 has *not* been written yet on standby 2. (the names
> "tmp_*" don't seem appropriate here)
>

You are right.
We have to handle write and flush LSN individually, and to get each lowest LSN.
For example in this case, we have to get write = 8, flush = 5.
I've change the logic so that.

Regards,

--
Masahiko Sawada

Attachment Content-Type Size
000_multi_sync_replication_v3.patch application/octet-stream 17.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2015-12-17 18:43:27 psql - -dry-run option
Previous Message Robert Haas 2015-12-17 18:20:44 Re: PATCH: postpone building buckets to the end of Hash (in HashJoin)