Re: Support for N synchronous standby servers - take 2

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: thomas(dot)munro(at)enterprisedb(dot)com
Cc: sawada(dot)mshk(at)gmail(dot)com, masao(dot)fujii(at)gmail(dot)com, michael(dot)paquier(at)gmail(dot)com, robertmhaas(at)gmail(dot)com, thom(at)linux(dot)com, memissemerson(at)gmail(dot)com, josh(at)agliodbs(dot)com, amit(dot)kapila16(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Support for N synchronous standby servers - take 2
Date: 2016-04-04 09:03:41
Message-ID: 20160404.180341.148979338.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

Hello, thank you for testing.

At Sat, 2 Apr 2016 14:20:55 +1300, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> wrote in <CAEepm=2sdDL2hs3XbWb5FORegNHBObLJ-8C2=aaeG-riZTd2Rw(at)mail(dot)gmail(dot)com>
> >>> Attached latest patch incorporate some review comments so far, and is
> >>> rebased against current HEAD.
> >>>
> >>
> >> Sorry I attached wrong patch.
> >> Attached patch is correct patch.
> >>
> >> [mulit_sync_replication_v21.patch]
> >
> > Here are some TPS numbers from some quick tests I ran on a set of
> > Amazon EC2 m3.large instances ("2 vCPU" virtual machines) configured
> > as primary + 3 standbys, to try out different combinations of
> > synchronous_commit levels and synchronous_standby_names numbers. They
> > were run for a short time only and these are of course systems with
> > limited and perhaps uneven IO and CPU, but they still give some idea
> > of the trends. And reassuringly, the trends are travelling in the
> > expected directions.
> >
> > All default settings except shared_buffers = 1GB, and the GUCs
> > required for replication.
> >
> > pgbench postgres -j2 -c2 -N bench2 -T 600
> >
> > 1(*) 2(*) 3(*)
> > ==== ==== ====
> > off = 4056 4096 4092
> > local = 1323 1299 1312
> > remote_write = 1130 1046 958
> > on = 860 744 701
> > remote_apply = 785 725 604
> >
> > pgbench postgres -j16 -c16 -N bench2 -T 600
> >
> > 1(*) 2(*) 3(*)
> > ==== ==== ====
> > off = 3952 3943 3933
> > local = 2964 2984 3026
> > remote_write = 2790 2724 2675
> > on = 2731 2627 2523
> > remote_apply = 2627 2501 2432
> >
> > One thing I noticed is that there are LOG messages telling me when a
> > standby becomes a synchronous standby, but nothing to tell me if a
> > standby stops being a standby (ie because a higher priority one has
> > taken its place in the quorum). Would that be interesting?

A walsender exits by proc_exit() for any operational
termination so wrapping proc_exit() should work. (Attached file 1)

For the setting "2(Sby1, Sby2, Sby3)", the master says that all
of the standbys are sync-standbys and no message is emited on
failure of Sby1, which should cause a promotion of Sby3.

> standby "Sby3" is now the synchronous standby with priority 3
> standby "Sby2" is now the synchronous standby with priority 2
> standby "Sby1" is now the synchronous standby with priority 1
..<Sby 1 failure>
> standby "Sby3" is now the synchronous standby with priority 3

Sby3 becomes sync standby twice:p

This was a behavior taken over from the single-sync-rep era but
it should be confusing for the new sync-rep selection mechanism.
The second attached diff makes this as the following.

> 17:48:21.969 LOG: standby "Sby3" is now a synchronous standby with priority 3
> 17:48:23.087 LOG: standby "Sby2" is now a synchronous standby with priority 2
> 17:48:25.617 LOG: standby "Sby1" is now a synchronous standby with priority 1
> 17:48:31.990 LOG: standby "Sby3" is now a potential synchronous standby with priority 3
> 17:48:43.905 LOG: standby "Sby3" is now a synchronous standby with priority 3
> 17:49:10.262 LOG: standby "Sby1" is now a synchronous standby with priority 1
> 17:49:13.865 LOG: standby "Sby3" is now a potential synchronous standby with priority 3

Since this status check is taken place for every reply from
stanbys, the message of downgrading to "potential" may be
diferred or even fail to occur but it should be no problem.

Applying the both of the above patches, the message would be like
the following.

> 17:54:08.367 LOG: standby "Sby3" is now a synchronous standby with priority 3
> 17:54:08.564 LOG: standby "Sby1" is now a synchronous standby with priority 1
> 17:54:08.565 LOG: standby "Sby2" is now a synchronous standby with priority 2
> 17:54:18.387 LOG: standby "Sby3" is now a potential synchronous standby with priority 3
> 17:54:28.887 LOG: synchronous standby "Sby1" with priority 1 exited
> 17:54:31.359 LOG: standby "Sby3" is now a synchronous standby with priority 3
> 17:54:39.008 LOG: standby "Sby1" is now a synchronous standby with priority 1
> 17:54:41.382 LOG: standby "Sby3" is now a potential synchronous standby with priority 3

Does this make sense?

By the way, Sawada-san, you have changed the parentheses for the
priority method from '[]' to '()'. And I mistankenly defined
s_s_names as '2[Sby1, Sby2, Sby3]' and got wrong behavior, that
is, only Sby2 is registed as mandatory synchronous standby.

For this case, the tree members of SyncRepConfig are '2[Sby1,',
'Sby2', "Sby3]'. This syntax is valid for the current
specification but will surely get different meaning by the future
changes. We should refuse this known-to-be-wrong-in-future syntax
from now.

And, this error was very hard to know. pg_setting only shows the
string itself

=# select name, setting from pg_settings where name = 'synchronous_standby_names';
name | setting
---------------------------+---------------------
synchronous_standby_names | 2[Sby1, Sby2, Sby3]
(1 row)

Since the sintax is no longer so simple, we may need some means
to see the current standby-group setting clearly, but it wont'be
if refusing the known....-future syntax now.

> > Also, I spotted some tiny mistakes:
> >
> > + <indexterm zone="high-availability">
> > + <primary>Dedicated language for multiple synchornous replication</primary>
> > + </indexterm>
> >
> > s/synchornous/synchronous/
> >
> > + /*
> > + * If we are managing the sync standby, though we weren't
> > + * prior to this, then announce we are now the sync standby.
> > + */
> >
> > s/ the / a / (two occurrences)
> >
> > + ereport(LOG,
> > + (errmsg("standby \"%s\" is now the synchronous standby with priority %u",
> > + application_name, MyWalSnd->sync_standby_priority)));
> >
> > s/ the / a /
> >
> > offered by a transaction commit. This level of protection is referred
> > - to as 2-safe replication in computer science theory.
> > + to as 2-safe replication in computer science theory, and group-1-safe
> > + (group-safe and 1-safe) when <varname>synchronous_commit</> is set to
> > + more than <literal>remote_write</>.
> >
> > Why "more than"? I think those two words should be changed to "at
> > least", or removed.
> >
> > + <para>
> > + This syntax allows us to define a synchronous group that will wait for at
> > + least N standbys of them, and a comma-separated list of group
> > members that are surrounded by
> > + parantheses. The special value <literal>*</> for server name
> > matches any standby.
> > + By surrounding list of group members using parantheses,
> > synchronous standbys are chosen from
> > + that group using priority method.
> > + </para>
> >
> > s/parantheses/parentheses/ (two occurrences)
> >
> > + <sect2 id="dedicated-language-for-multi-sync-replication-priority">
> > + <title>Prioirty Method</title>
> >
> > s/Prioirty Method/Priority Method/
>
> A couple more comments:
>
> /*
> - * If we aren't managing the highest priority standby then just leave.
> + * If the number of sync standbys is less than requested or we aren't
> + * managing the sync standby then just leave.
> */
> - if (syncWalSnd != MyWalSnd)
> + if (!got_oldest || !am_sync)
>
> s/ the sync / a sync /
>
> + /*
> + * Consider all pending standbys as sync if the number of them plus
> + * already-found sync ones is lower than the configuration requests.
> + */
> + if (list_length(result) + list_length(pending) <= SyncRepConfig->num_sync)
> + return list_concat(result, pending);
>
> The cells from 'pending' will be attached to 'result', and 'result'
> will be freed by the caller. But won't the List header object from
> 'pending' be leaked?
>
> + result = lappend_int(result, i);
> + if (list_length(result) == SyncRepConfig->num_sync)
> + {
> + list_free(pending);
> + return result; /* Exit if got enough sync standbys */
> + }
>
> If we didn't take the early return in the list-not-long-enough case
> mentioned above, we should *always* exit via this return statement,
> right? Since we know that the pending list had enough elements to
> reach num_sync. I think that is worth a comment, and also a "not
> reached" comment at the bottom of the function, if it is true.
>
> As a future improvement, I wonder if we could avoid recomputing the
> current set of sync standbys in every walsender every time we call
> SyncRepReleaseWaiters, perhaps by maintaining that set incrementally
> in shmem when walsender states change etc.
>
> I don't have any other comments, other than to say: thank you to all
> the people who have contributed to this feature so far and I really
> really hope it goes into 9.6!

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
multi_sync_exit_msg.diff text/x-patch 3.1 KB
multi_sync_potential_msg.diff text/x-patch 1.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Emre Hasegeli 2016-04-04 09:19:47 Re: WIP: Access method extendability
Previous Message Abhijit Menon-Sen 2016-04-04 08:59:20 Re: Support for N synchronous standby servers - take 2