Re: Support for N synchronous standby servers - take 2

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: sawada(dot)mshk(at)gmail(dot)com
Cc: masao(dot)fujii(at)gmail(dot)com, memissemerson(at)gmail(dot)com, josh(at)agliodbs(dot)com, robertmhaas(at)gmail(dot)com, amit(dot)kapila16(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Support for N synchronous standby servers - take 2
Date: 2015-11-17 00:57:06
Message-ID: 20151117.095706.240836667.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

At Tue, 17 Nov 2015 01:09:57 +0900, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote in <CAD21AoDhqGB=EtBfqnkHxR8T53d+8qMs4DPm5HVyq4bA2oR5eQ(at)mail(dot)gmail(dot)com>
> > - Notation
> >
> > synchronous_standby_names, and synchronous_replication_method as
> > a variable to provide other syntax is probably no argument
> > except its name. But I feel synchronous_standby_num looks bit
> > too specific.
> >
> > I'd like to propose if this doesn't reprise the argument on
> > notation for replication definitions:p
> >
> > The following two GUCs would be enough to bear future expansion
> > of notation syntax and/or method.
> >
> > synchronous_standby_names : as it is
> >
> > synchronous_replication_method:
> >
> > default is "1-priority", which means the same with the current
> > meaning. possible additional values so far would be,
> >
> > "n-priority": the format of s_s_names is "n, <name>, <name>, <name>...",
> > where n is the number of required acknowledges.
>
> One question is that what is different between the leading "n" in
> s_s_names and the leading "n" of "n-priority"?

Ah. Sorry for the ambiguous description. 'n' in s_s_names
representing an arbitrary integer number and that in "n-priority"
is literally an "n", meaning "a format with any number of
priority hosts" as a whole. As an instance,

synchronous_replication_method = "n-priority"
synchronous_standby_names = "2, mercury, venus, earth, mars, jupiter"

I added "n-" of "n-priority" to distinguish with "1-priority" so
if we won't provide "1-priority" for backward compatibility,
"priority" would be enough to represent the type.

By the way, s_r_method is not essentially necessary but it would
be important to avoid complexity of autodetection of formats
including currently undefined ones.

> > "n-quorum": the format of s_s_names is the same as above, but
> > it is read in quorum context.

The "n" of this is the same as above.

> > These can be expanded, for example, as follows, but in future.
> >
> > "complex" : Michael's format.
> > "json" : JSON?
> > "json-ext": specify JSON in external file.
> >
> > Even after we have complex notations, I suppose that many use
> > cases are coverd by the first tree notations.
>
> I'm not sure it's desirable to implement the all kind of methods into core.
> I think it's better to extend replication in order to be more
> extensibility like adding hook function.
> And then other approach is implemented as a contrib module.

I agree with you. I proposed the following internal design having
that in mind.

> > - Internal design
> >
> > What should be done in SyncRepReleaseWaiters() is calculating a
> > pair of LSNs that can be regarded as synced and decide whether
> > *this* walsender have advanced the LSN pair, then trying to
> > release backends that wait for the LSNs *if* this walsender has
> > advanced them.
> >
> > From such point, the proposed patch will make redundant trials
> > to release backens.
> >
> > Addition to that, the patch looks to be a mixture of the current
> > implement and the new feature. These are for the same objective
> > so they cannot coexist each other, I think. As the result, codes
> > for both quorum/priority judgement appear at multiple level in
> > call tree. This would be an obstacle for future (possible)
> > expansion.
> >
> > So, I think this feature should be implemented as following,
> >
> > SyncRepInitConfig reads the configuration and stores the result
> > structure into elsewhere such like WalSnd->syncrepset_definition
> > instead of WalSnd->sync_standby_priority, which should be
> > removed. Nothing would be stored if the current wal sender is
> > not a member of the defined replication set. Storing a pointer
> > to matching function there would increase the flexibility but
> > such implement in contrast will make the code difficult to be
> > read.. (I often look for the entity of xlogreader->read_page()
> > ;)
> >
> > Then SyncRepSyncedLsnAdvancedTo() instead of
> > SyncRepGetSynchronousStandbys() returns an LSN pair that can be
> > regarded as 'synced' according to specified definition of
> > replication set and whether this walsender have advanced the
> > LSNs.
> >
> > Finally, SyncRepReleaseWaiters() uses it to release backends if
> > needed.
> >
> > The differences among quorum/priority or others are confined in
> > SyncRepSyncedLsnAdvancedTo(). As the result,
> > SyncRepReleaseWaiters would look as following.
> >
> > | SyncRepReleaseWaiters(void)
> > | {
> > | if (MyWalSnd->syncrepset_definition == NULL || ...)
> > | return;
> > | ...
> > | if (!SyncRepSyncedLsnAdvancedTo(&flush_pos, &write_pos))
> > | {
> > | /* I haven't advanced the synced LSNs */
> > | LWLockRelease(SyncRepLock);
> > | rerturn;
> > | }
> > | /* Set the lsn first so that when we wake backends they will relase...
> >
> > I'm not thought concretely about what SyncRepSyncedLsnAdvancedTo
> > does but perhaps yes we can:p in effective manner..
> >
> > What do you think about this?
>
> I agree with this design.
> What SyncRepSyncedLsnAdvancedTo() does would be different for each
> method, so we can implement "n-priority" style multiple sync
> replication at first version.

Maybe the first *additional* one if we decide to keep backward
compatibility, as the discussion above.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2015-11-17 01:45:56 Re: Freeze avoidance of very large table.
Previous Message Peter Geoghegan 2015-11-17 00:52:34 Re: Should TIDs be typbyval = FLOAT8PASSBYVAL to speed up CREATE INDEX CONCURRENTLY?