Re: Support for N synchronous standby servers - take 2

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Masao Fujii <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, sawada(dot)mshk(at)gmail(dot)com, thom(at)linux(dot)com, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, memissemerson(at)gmail(dot)com, Josh Berkus <josh(at)agliodbs(dot)com>, amit(dot)kapila16(at)gmail(dot)com, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Support for N synchronous standby servers - take 2
Date: 2016-02-09 04:31:46
Message-ID: CAB7nPqSJgDLLsVk_Et-O=NBfJNqx3GbHszCYGvuTLRxHaZV3xQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 9, 2016 at 1:16 PM, Kyotaro HORIGUCHI
<horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> Hello,
>
> At Tue, 9 Feb 2016 00:48:57 +0900, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote in <CAHGQGwHnTKmd90Vu19Swu0C+2mnWxvAH=1FE=-xUbo3s94pRRg(at)mail(dot)gmail(dot)com>
>> On Fri, Feb 5, 2016 at 5:36 PM, Michael Paquier
>> <michael(dot)paquier(at)gmail(dot)com> wrote:
>> > On Thu, Feb 4, 2016 at 11:06 PM, Michael Paquier
>> > <michael(dot)paquier(at)gmail(dot)com> wrote:
>> >> On Thu, Feb 4, 2016 at 10:49 PM, Michael Paquier
>> >> <michael(dot)paquier(at)gmail(dot)com> wrote:
>> >>> On Thu, Feb 4, 2016 at 10:40 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> >>>> On Thu, Feb 4, 2016 at 2:21 PM, Michael Paquier
>> >>>> <michael(dot)paquier(at)gmail(dot)com> wrote:
>> >>>>> Yes, please let's use the custom language, and let's not care of not
>> >>>>> more than 1 level of nesting so as it is possible to represent
>> >>>>> pg_stat_replication in a simple way for the user.
>> >>>>
>> >>>> "not" is used twice in this sentence in a way that renders me not able
>> >>>> to be sure that I'm not understanding it not properly.
>> >>>
>> >>> 4 times here. Score beaten.
>> >>>
>> >>> Sorry. Perhaps I am tired... I was just wondering if it would be fine
>> >>> to only support configurations up to one level of nested objects, like
>> >>> that:
>> >>> 2[node1, node2, node3]
>> >>> node1, 2[node2, node3], node3
>> >>> In short, we could restrict things so as we cannot define a group of
>> >>> nodes within an existing group.
>> >>
>> >> No, actually, that's stupid. Having up to two nested levels makes more
>> >> sense, a quite common case for this feature being something like that:
>> >> 2{node1,[node2,node3]}
>> >> In short, sync confirmation is waited from node1 and (node2 or node3).
>> >>
>> >> Flattening groups of nodes with a new catalog will be necessary to
>> >> ease the view of this data to users:
>> >> - group name?
>> >> - array of members with nodes/groups
>> >> - group type: quorum or priority
>> >> - number of items to wait for in this group
>> >
>> > So, here are some thoughts to make that more user-friendly. I think
>> > that the critical issue here is to properly flatten the meta data in
>> > the custom language and represent it properly in a new catalog,
>> > without messing up too much with the existing pg_stat_replication that
>> > people are now used to for 5 releases since 9.0. So, I would think
>> > that we will need to have a new catalog, say
>> > pg_stat_replication_groups with the following things:
>> > - One line of this catalog represents the status of a group or of a single node.
>> > - The status of a node/group is either sync or potential, if a
>> > node/group is specified more than once, it may be possible that it
>> > would be sync and potential depending on where it is defined, in which
>> > case setting its status to 'sync' has the most sense. If it is in sync
>> > state I guess.
>> > - Move sync_priority and sync_state, actually an equivalent from
>> > pg_stat_replication into this new catalog, because those represent the
>> > status of a node or group of nodes.
>> > - group name, and by that I think that we had perhaps better make
>> > mandatory the need to append a name with a quorum or priority group.
>> > The group at the highest level is forcibly named as 'top', 'main', or
>> > whatever if not directly specified by the user. If the entry is
>> > directly a node, use the application_name.
>> > - Type of group, quorum or priority
>> > - Elements in this group, an element can be a group name or a node
>> > name, aka application_name. If group is of type priority, the elements
>> > are listed in increasing order. So the elements with lower priority
>> > get first, etc. We could have one column listing explicitly a list of
>> > integers that map with the elements of a group but it does not seem
>> > worth it, what users would like to know is what are the nodes that are
>> > prioritized. This covers the former 'priority' field of
>> > pg_stat_replication.
>> >
>> > We may have a good idea of how to define a custom language, still we
>> > are going to need to design a clean interface at catalog level more or
>> > less close to what is written here. If we can get a clean interface,
>> > the custom language implemented, and TAP tests that take advantage of
>> > this user interface to check the node/group statuses, I guess that we
>> > would be in good shape for this patch.
>> >
>> > Anyway that's not a small project, and perhaps I am over-complicating
>> > the whole thing.
>> >
>> > Thoughts?
>>
>> I agree that we would need something like such new view in the future,
>> however it seems too late to work on that for 9.6 unfortunately.
>> There is only one CommitFest left. Let's focus on very simple case, i.e.,
>> 1-level priority list, now, then we can extend it to cover other cases.
>>
>> If we can commit the simple version too early and there is enough
>> time before the date of feature freeze, of course I'm happy to review
>> the extended version like you proposed, for 9.6.
>
> I agree to Fujii-san. There would be many of convenient gadgets
> around this and they are completely welcome, but having
> fundamental functionality in 9.6 seems to be far benetifical for
> most of us.

Hm. Rushing features in because we need them now is not really
community-like. I'd rather not have us taking decisions like that
knowing that we may pay a certain price in the long-term, while it
pays in the short term, aka the 9.6 release. However, having a base in
place for the mini-language would give enough room for future
improvements, so I am fine with having only 1-level of nesting, with
{} and [] supported. This can as well be simply represented within
pg_stat_replication because we'd have basically only one group of
nodes for now (if I got the idea correctly), the and status of each
entry in pg_stat_replication would just need to reflect either
potential or sync, which is something that now users are used to.

So, if I got the vibe correctly, we would basically just allow that in
a first shot:
N{node_list}, to define a priority group
N[node_list], to define a quorum group
There can be only one group, and elements in a node list cannot be a
group. No need of group names either.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2016-02-09 04:38:35 Re: Use %u to print user mapping's umid and userid
Previous Message Ashutosh Bapat 2016-02-09 04:22:15 Re: Use %u to print user mapping's umid and userid