Re: Support for N synchronous standby servers - take 2

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thom Brown <thom(at)linux(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Support for N synchronous standby servers - take 2
Date: 2016-03-16 01:13:48
Message-ID: CAEepm=3Ye+Ax_5=MZeHMkx9DFn25QoRzs362sQGNvGcVWx+18w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

<para>
Synchronous replication offers the ability to confirm that all changes
- made by a transaction have been transferred to one synchronous standby
- server. This extends the standard level of durability
+ made by a transaction have been transferred to one or more
synchronous standby
+ server. This extends that standard level of durability
offered by a transaction commit. This level of protection is referred
- to as 2-safe replication in computer science theory.
+ to as group-safe replication in computer science theory.
</para>

A message on the -general list today pointed me to some earlier
discussion[1] which quoted and referenced definitions of these
academic terms[2]. I think the above documentation should say:

"This level of protection is referred to as 2-safe replication in
computer science literature when <variable>synchronous_commit</> is
set to <literal>on</>, and group-1-safe (group-safe and 1-safe) when
<variable>synchronous_commit</> is set to <literal>remote_write</>."

By my reading, the situation doesn't actually change with this patch.
It doesn't matter whether you need 1 or 42 synchronous standbys to
make a quorum: 2-safe means durable (fsync) on all of them,
group-1-safe means durable on one server and received (implied by
remote_write) by all of them.

I think we should be using those definitions because Gray's earlier
definition of 2-safe from Transaction Processing 12.6.3 doesn't really
fit: It can optionally mean remote receipt or remote durable storage,
but it doesn't wait if the 'backup' is down, so it's not the same type
of guarantee. (He also has 'very safe' which might describe our
syncrep, I'm not sure.)

[1] http://www.postgresql.org/message-id/603c8f070812132142n5408e7ddk899e83cddd4cb0b2@mail.gmail.com
[2] http://infoscience.epfl.ch/record/33053/files/EPFL_TH2577.pdf page 76

On Thu, Mar 10, 2016 at 11:21 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Fri, Mar 4, 2016 at 3:40 AM, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>> On Thu, Mar 3, 2016 at 11:30 PM, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>>> Hi,
>>>
>>> Thank you so much for reviewing this patch!
>>>
>>> All review comments regarding document and comment are fixed.
>>> Attached latest v14 patch.
>>>
>>>> This accepts 'abc^Id' as a name, which is wrong behavior (but
>>>> such appliction names are not allowed anyway. If you assume so,
>>>> I'd like to see a comment for that.).
>>>
>>> 'abc^Id' is accepted as application_name, no?
>>> postgres(1)=# set application_name to 'abc^Id';
>>> SET
>>> postgres(1)=# show application_name ;
>>> application_name
>>> ------------------
>>> abc^Id
>>> (1 row)
>>>
>>>> addlit_xd_string(char *ytext) and addlitchar_xd_string(unsigned
>>>> char ychar) requires differnt character types. Is there any reason
>>>> for that?
>>>
>>> Because addlit_xd_string() is for adding string(char *) to xd_string,
>>> OTOH addlit_xd_char() is for adding just one character to xd_string.
>>>
>>>> I personally don't like addlit*string() things for such simple
>>>> syntax but itself is acceptble enough for me. However it uses
>>>> StringInfo to hold double-quoted names, which pallocs 1024 bytes
>>>> of memory chunk for every double-quoted name. The chunks are
>>>> finally stacked up left uncollected until the current
>>>> memorycontext is deleted or reset (It is deleted just after
>>>> finishing config file processing). Addition to that, setting
>>>> s_s_names runs the parser twice. It seems to me too greedy and
>>>> seems that static char [NAMEDATALEN] is enough using the v12 way
>>>> without palloc/repalloc.
>>>
>>> I though that length of group name could be more than NAMEDATALEN, so
>>> I use StringInfo.
>>> Is it not necessary?
>>>
>>>> I found that the name SyncGroupName.wait_num is not
>>>> instinctive. How about sync_num, sync_member_num or
>>>> sync_standby_num? If the last is preferable, .members also should
>>>> be .standbys .
>>>
>>> Thanks, sync_num is preferable to me.
>>>
>>> ===
>>>> I am quite uncomfortable with the existence of
>>>> WanSnd.sync_standby_priority. It represented the pirority in the
>>>> old linear s_s_names format but nested groups or even
>>>> single-level quarum list obviously doesn't fit it. Can we get rid
>>>> of sync_standby_priority, even though we realize atmost
>>>> n-priority for now?
>>>
>>> We could get rid of sync_standby_priority.
>>> But if so, we will not be able to see the next sync standby in
>>> pg_stat_replication system view.
>>> Regarding each node priority, I was thinking that standbys in quorum
>>> list have same priority, and in nested group each standbys are given
>>> the priority starting from 1.
>>>
>>> ===
>>>> The function SyncRepGetSyncedLsnsUsingPriority doesn't seem to
>>>> have specific code for every prioritizing method (which are
>>>> priority, quorum, nested and so). Is there any reson to use it as
>>>> a callback of SyncGroupNode?
>>>
>>> The reason why the current code is so is that current code is for only
>>> priority method supporting.
>>> At first version of this feature, I'd like to implement it more simple.
>>>
>>> Aside from this, of course I'm planning to have specific code for nested design.
>>> - The group can have some name nodes or group nodes.
>>> - The group can use either 2 types of method: priority or quorum.
>>> - The group has SyncRepGetSyncedLsnFn() and SyncRepGetStandbysFn()
>>> - SyncRepGetSyncedLsnsFn() function recursively determine synced LSN
>>> at that moment using group's method.
>>> - SyncRepGetStandbysFn() function returns standbys of its group,
>>> which are considered as sync using group's method.
>>>
>>> For example, s_s_name = '3(a, b, 2[c,d]::group1)', SyncRepStandbys
>>> memory structure will be,
>>>
>>> "main(quorum)" --- "a"
>>> |
>>> -- "b"
>>> |
>>> -- "group1(priority)" --- "c"
>>> |
>>> -- "d"
>>>
>>> When determine synced LSNs, we need to consider group1's LSN using by
>>> priority method at first, and then we can determine main's LSN using
>>> by quorum method with "a" LSNs, "b" LSNs and "group1" LSNs.
>>> So SyncRepGetSyncedLsnsUsingPriority() function would be,
>>>
>>> bool
>>> SyncRepGetSyncedLsnsUsingPriority(*group, *write_lsn, *flush_lsn)
>>> {
>>> sync_num = group->SynRepGetSyncstandbysFn(group, sync_list);
>>>
>>> if (sync_num < group->sync_num)
>>> return false;
>>>
>>> for (each member of sync_list)
>>> {
>>> if (member->type == group node)
>>> call SyncRepGetSyncedLsnsFn(member, w, f) and store w and
>>> f into lsn_list.
>>> else
>>> Store name node LSNs into lsn_list.
>>> }
>>>
>>> Determine synced LSNs of this group using lsn_list and priority method.
>>> Store synced LSNs into write_lsn and flush_lsn.
>>> return true;
>>> }
>>>
>>>> SyncRepClearStandbyGroupList is defined in syncrep.c but the
>>>> other related functions are defined in syncgroup_gram.y. It would
>>>> be better to place them together.
>>>
>>> SyncRepClearStandbyGroupList() is used by
>>> check_synchronous_standby_names(), so I put this function syncrep.c.
>>>
>>>> SyncRepStandbys are to be in multilevel and the struct is
>>>> naturally allowed to be so but SyncRepClearStandbyGroupList
>>>> assumes it in single level.
>>>
>>> Because I think that we don't need to implement to fully support
>>> nested style at first version.
>>> We have to carefully design this feature while considering
>>> expandability, but overkill implementation could be cause of crash.
>>> Consider remaining time for 9.6, I feel we could implement quorum
>>> method at best.
>>>
>>>> This is a comment from the aspect of abstractness of objects.
>>>> The callers of SyncRepGetSyncStandbysUsingPriority() need to care
>>>> the inside of SyncGroupNode but what the function should just
>>>> return seems to be the list of wansnds element. Element number is
>>>> useless when the SyncGroupNode nests.
>>>> > int
>>>> > SyncRepGetSyncStandbysUsingPriority(SyncGroupNode *group, volatile WalSnd **sync_list)
>>>> This might need to expose 'volatile WalSnd*' (only pointer type)
>>>> outside of walsender.
>>>> Or it should return the list of index number of
>>>> *WalSndCtl->walsnds*.
>>>
>>> SyncRepGetSyncStandbysUsingPriority() already returns the list of
>>> index number of "WalSndCtl->walsnd" as sync_list, no?
>>> As I mentioned above, SyncRepGetSyncStandbysFn() doesn't need care the
>>> inside of SyncGroupNode in my design.
>>> Selecting sync nodes from its group doesn't depend on the type of node.
>>> What SyncRepGetSyncStandbyFn() should do is to select sync node from
>>> *its* group.
>>>
>>
>> Previous patch has bug around GUC parameter handling.
>> Attached updated version.
>
> Thanks for updating the patch!
>
> Now I'm fixing some problems (e.g., current patch doesn't work
> with EXEC_BACKEND environment) and revising the patch.
> I will post the revised version this weekend or the first half
> of next week.
>
> Regards,
>
> --
> Fujii Masao

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2016-03-16 01:23:53 Re: Parallel Aggregate
Previous Message Robert Haas 2016-03-16 01:08:29 Re: Parallel Aggregate