Quick Links

Re: Support for N synchronous standby servers - take 2

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	thomas(dot)munro(at)enterprisedb(dot)com
Cc:	masao(dot)fujii(at)gmail(dot)com, sawada(dot)mshk(at)gmail(dot)com, michael(dot)paquier(at)gmail(dot)com, robertmhaas(at)gmail(dot)com, thom(at)linux(dot)com, memissemerson(at)gmail(dot)com, josh(at)agliodbs(dot)com, amit(dot)kapila16(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Support for N synchronous standby servers - take 2
Date:	2016-03-16 07:48:33
Message-ID:	20160316.164833.188624159.horiguchi.kyotaro@lab.ntt.co.jp
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

It seems to me a matter of definition of "available replicas".

At Wed, 16 Mar 2016 14:13:48 +1300, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> wrote in <CAEepm=3Ye+Ax_5=MZeHMkx9DFn25QoRzs362sQGNvGcVWx+18w(at)mail(dot)gmail(dot)com>
> <para>
> Synchronous replication offers the ability to confirm that all changes
> - made by a transaction have been transferred to one synchronous standby
> - server. This extends the standard level of durability
> + made by a transaction have been transferred to one or more
> synchronous standby
> + server. This extends that standard level of durability
> offered by a transaction commit. This level of protection is referred
> - to as 2-safe replication in computer science theory.
> + to as group-safe replication in computer science theory.
> </para>
>
> A message on the -general list today pointed me to some earlier
> discussion[1] which quoted and referenced definitions of these
> academic terms[2]. I think the above documentation should say:
>
> "This level of protection is referred to as 2-safe replication in
> computer science literature when <variable>synchronous_commit</> is
> set to <literal>on</>, and group-1-safe (group-safe and 1-safe) when
> <variable>synchronous_commit</> is set to <literal>remote_write</>."

I suppose that the "available replica" on the paper is equivalent
to "one choosen synchronous server" at the top of the queue of
living standbys specified by s_s_names. The original description
is true based on this interpretation.

> By my reading, the situation doesn't actually change with this patch.
> It doesn't matter whether you need 1 or 42 synchronous standbys to
> make a quorum: 2-safe means durable (fsync) on all of them,
> group-1-safe means durable on one server and received (implied by
> remote_write) by all of them.

Likewise, "the first two of the living standbys" (2[r01, ..r42])
and the master is translated to "three replicas". So it keeps
2-safe for the case.

> I think we should be using those definitions because Gray's earlier
> definition of 2-safe from Transaction Processing 12.6.3 doesn't really
> fit: It can optionally mean remote receipt or remote durable storage,
> but it doesn't wait if the 'backup' is down, so it's not the same type
> of guarantee. (He also has 'very safe' which might describe our
> syncrep, I'm not sure.)

If the discussion above is true, the description doesn't seem to
need to be amended in the view of the safe-criteria.

> <para>
> Synchronous replication offers the ability to confirm that all changes
> - made by a transaction have been transferred to one synchronous standby
> - server. This extends the standard level of durability
> + made by a transaction have been transferred to one or more synchronous standby
> + server. This extends that standard level of durability
> offered by a transaction commit. This level of protection is referred
> to as 2-safe replication in computer science theory.
> </para>

But some additional explanation might be needed.

For the true quorum commit, a client will be notified when the
master and any n of all standbys have committed. This won't fit
exactly to the criterias in the paper.

In regard to Gray's definition, "2-safe" looks to be PG's syncrep
with automatic release mechanism, such like what pgsql-RA
offers. And "high availability" doesn't seem to fit to
PostgreSQL's behavior because the master virtually commits a
transaction before making an agreement to commit among all of
replicas.

# I'm reading it in Japanese so some words may be incorrect.

Thoughts?

> [1] http://www.postgresql.org/message-id/603c8f070812132142n5408e7ddk899e83cddd4cb0b2@mail.gmail.com
> [2] http://infoscience.epfl.ch/record/33053/files/EPFL_TH2577.pdf page 76
>
> On Thu, Mar 10, 2016 at 11:21 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> > On Fri, Mar 4, 2016 at 3:40 AM, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >> On Thu, Mar 3, 2016 at 11:30 PM, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >>> Hi,
> >>>
> >>> Thank you so much for reviewing this patch!
> >>>
> >>> All review comments regarding document and comment are fixed.
> >>> Attached latest v14 patch.
> >>>
> >>>> This accepts 'abc^Id' as a name, which is wrong behavior (but
> >>>> such appliction names are not allowed anyway. If you assume so,
> >>>> I'd like to see a comment for that.).
> >>>
> >>> 'abc^Id' is accepted as application_name, no?
> >>> postgres(1)=# set application_name to 'abc^Id';
> >>> SET
> >>> postgres(1)=# show application_name ;
> >>> application_name
> >>> ------------------
> >>> abc^Id
> >>> (1 row)
> >>>
> >>>> addlit_xd_string(char *ytext) and addlitchar_xd_string(unsigned
> >>>> char ychar) requires differnt character types. Is there any reason
> >>>> for that?
> >>>
> >>> Because addlit_xd_string() is for adding string(char *) to xd_string,
> >>> OTOH addlit_xd_char() is for adding just one character to xd_string.
> >>>
> >>>> I personally don't like addlit*string() things for such simple
> >>>> syntax but itself is acceptble enough for me. However it uses
> >>>> StringInfo to hold double-quoted names, which pallocs 1024 bytes
> >>>> of memory chunk for every double-quoted name. The chunks are
> >>>> finally stacked up left uncollected until the current
> >>>> memorycontext is deleted or reset (It is deleted just after
> >>>> finishing config file processing). Addition to that, setting
> >>>> s_s_names runs the parser twice. It seems to me too greedy and
> >>>> seems that static char [NAMEDATALEN] is enough using the v12 way
> >>>> without palloc/repalloc.
> >>>
> >>> I though that length of group name could be more than NAMEDATALEN, so
> >>> I use StringInfo.
> >>> Is it not necessary?
> >>>
> >>>> I found that the name SyncGroupName.wait_num is not
> >>>> instinctive. How about sync_num, sync_member_num or
> >>>> sync_standby_num? If the last is preferable, .members also should
> >>>> be .standbys .
> >>>
> >>> Thanks, sync_num is preferable to me.
> >>>
> >>> ===
> >>>> I am quite uncomfortable with the existence of
> >>>> WanSnd.sync_standby_priority. It represented the pirority in the
> >>>> old linear s_s_names format but nested groups or even
> >>>> single-level quarum list obviously doesn't fit it. Can we get rid
> >>>> of sync_standby_priority, even though we realize atmost
> >>>> n-priority for now?
> >>>
> >>> We could get rid of sync_standby_priority.
> >>> But if so, we will not be able to see the next sync standby in
> >>> pg_stat_replication system view.
> >>> Regarding each node priority, I was thinking that standbys in quorum
> >>> list have same priority, and in nested group each standbys are given
> >>> the priority starting from 1.
> >>>
> >>> ===
> >>>> The function SyncRepGetSyncedLsnsUsingPriority doesn't seem to
> >>>> have specific code for every prioritizing method (which are
> >>>> priority, quorum, nested and so). Is there any reson to use it as
> >>>> a callback of SyncGroupNode?
> >>>
> >>> The reason why the current code is so is that current code is for only
> >>> priority method supporting.
> >>> At first version of this feature, I'd like to implement it more simple.
> >>>
> >>> Aside from this, of course I'm planning to have specific code for nested design.
> >>> - The group can have some name nodes or group nodes.
> >>> - The group can use either 2 types of method: priority or quorum.
> >>> - The group has SyncRepGetSyncedLsnFn() and SyncRepGetStandbysFn()
> >>> - SyncRepGetSyncedLsnsFn() function recursively determine synced LSN
> >>> at that moment using group's method.
> >>> - SyncRepGetStandbysFn() function returns standbys of its group,
> >>> which are considered as sync using group's method.
> >>>
> >>> For example, s_s_name = '3(a, b, 2[c,d]::group1)', SyncRepStandbys
> >>> memory structure will be,
> >>>
> >>> "main(quorum)" --- "a"
> >>> |
> >>> -- "b"
> >>> |
> >>> -- "group1(priority)" --- "c"
> >>> |
> >>> -- "d"
> >>>
> >>> When determine synced LSNs, we need to consider group1's LSN using by
> >>> priority method at first, and then we can determine main's LSN using
> >>> by quorum method with "a" LSNs, "b" LSNs and "group1" LSNs.
> >>> So SyncRepGetSyncedLsnsUsingPriority() function would be,
> >>>
> >>> bool
> >>> SyncRepGetSyncedLsnsUsingPriority(*group, *write_lsn, *flush_lsn)
> >>> {
> >>> sync_num = group->SynRepGetSyncstandbysFn(group, sync_list);
> >>>
> >>> if (sync_num < group->sync_num)
> >>> return false;
> >>>
> >>> for (each member of sync_list)
> >>> {
> >>> if (member->type == group node)
> >>> call SyncRepGetSyncedLsnsFn(member, w, f) and store w and
> >>> f into lsn_list.
> >>> else
> >>> Store name node LSNs into lsn_list.
> >>> }
> >>>
> >>> Determine synced LSNs of this group using lsn_list and priority method.
> >>> Store synced LSNs into write_lsn and flush_lsn.
> >>> return true;
> >>> }
> >>>
> >>>> SyncRepClearStandbyGroupList is defined in syncrep.c but the
> >>>> other related functions are defined in syncgroup_gram.y. It would
> >>>> be better to place them together.
> >>>
> >>> SyncRepClearStandbyGroupList() is used by
> >>> check_synchronous_standby_names(), so I put this function syncrep.c.
> >>>
> >>>> SyncRepStandbys are to be in multilevel and the struct is
> >>>> naturally allowed to be so but SyncRepClearStandbyGroupList
> >>>> assumes it in single level.
> >>>
> >>> Because I think that we don't need to implement to fully support
> >>> nested style at first version.
> >>> We have to carefully design this feature while considering
> >>> expandability, but overkill implementation could be cause of crash.
> >>> Consider remaining time for 9.6, I feel we could implement quorum
> >>> method at best.
> >>>
> >>>> This is a comment from the aspect of abstractness of objects.
> >>>> The callers of SyncRepGetSyncStandbysUsingPriority() need to care
> >>>> the inside of SyncGroupNode but what the function should just
> >>>> return seems to be the list of wansnds element. Element number is
> >>>> useless when the SyncGroupNode nests.
> >>>> > int
> >>>> > SyncRepGetSyncStandbysUsingPriority(SyncGroupNode *group, volatile WalSnd **sync_list)
> >>>> This might need to expose 'volatile WalSnd*' (only pointer type)
> >>>> outside of walsender.
> >>>> Or it should return the list of index number of
> >>>> *WalSndCtl->walsnds*.
> >>>
> >>> SyncRepGetSyncStandbysUsingPriority() already returns the list of
> >>> index number of "WalSndCtl->walsnd" as sync_list, no?
> >>> As I mentioned above, SyncRepGetSyncStandbysFn() doesn't need care the
> >>> inside of SyncGroupNode in my design.
> >>> Selecting sync nodes from its group doesn't depend on the type of node.
> >>> What SyncRepGetSyncStandbyFn() should do is to select sync node from
> >>> *its* group.
> >>>
> >>
> >> Previous patch has bug around GUC parameter handling.
> >> Attached updated version.
> >
> > Thanks for updating the patch!
> >
> > Now I'm fixing some problems (e.g., current patch doesn't work
> > with EXEC_BACKEND environment) and revising the patch.
> > I will post the revised version this weekend or the first half
> > of next week.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Re: Support for N synchronous standby servers - take 2 at 2016-03-16 01:13:48 from Thomas Munro

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Ashutosh Bapat	2016-03-16 08:10:40	Re: Re: [COMMITTERS] pgsql: Only try to push down foreign joins if the user mapping OIDs mat
Previous Message	Craig Ringer	2016-03-16 07:47:16	Re: Logical decoding slots can go backwards when used from SQL, docs are wrong