Re: Support for N synchronous standby servers - take 2

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Thom Brown <thom(at)linux(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Beena Emerson <memissemerson(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Support for N synchronous standby servers - take 2
Date: 2016-02-05 09:19:24
Message-ID: CAD21AoA9UqcbTnDKi0osd0yhN4FPgTrg6wuZeTtvpSYy2LqL5Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Feb 5, 2016 at 5:36 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> On Thu, Feb 4, 2016 at 11:06 PM, Michael Paquier
> <michael(dot)paquier(at)gmail(dot)com> wrote:
>> On Thu, Feb 4, 2016 at 10:49 PM, Michael Paquier
>> <michael(dot)paquier(at)gmail(dot)com> wrote:
>>> On Thu, Feb 4, 2016 at 10:40 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>>> On Thu, Feb 4, 2016 at 2:21 PM, Michael Paquier
>>>> <michael(dot)paquier(at)gmail(dot)com> wrote:
>>>>> Yes, please let's use the custom language, and let's not care of not
>>>>> more than 1 level of nesting so as it is possible to represent
>>>>> pg_stat_replication in a simple way for the user.
>>>>
>>>> "not" is used twice in this sentence in a way that renders me not able
>>>> to be sure that I'm not understanding it not properly.
>>>
>>> 4 times here. Score beaten.
>>>
>>> Sorry. Perhaps I am tired... I was just wondering if it would be fine
>>> to only support configurations up to one level of nested objects, like
>>> that:
>>> 2[node1, node2, node3]
>>> node1, 2[node2, node3], node3
>>> In short, we could restrict things so as we cannot define a group of
>>> nodes within an existing group.
>>
>> No, actually, that's stupid. Having up to two nested levels makes more
>> sense, a quite common case for this feature being something like that:
>> 2{node1,[node2,node3]}
>> In short, sync confirmation is waited from node1 and (node2 or node3).
>>
>> Flattening groups of nodes with a new catalog will be necessary to
>> ease the view of this data to users:
>> - group name?
>> - array of members with nodes/groups
>> - group type: quorum or priority
>> - number of items to wait for in this group
>
> So, here are some thoughts to make that more user-friendly. I think
> that the critical issue here is to properly flatten the meta data in
> the custom language and represent it properly in a new catalog,
> without messing up too much with the existing pg_stat_replication that
> people are now used to for 5 releases since 9.0. So, I would think
> that we will need to have a new catalog, say
> pg_stat_replication_groups with the following things:
> - One line of this catalog represents the status of a group or of a single node.
> - The status of a node/group is either sync or potential, if a
> node/group is specified more than once, it may be possible that it
> would be sync and potential depending on where it is defined, in which
> case setting its status to 'sync' has the most sense. If it is in sync
> state I guess.
> - Move sync_priority and sync_state, actually an equivalent from
> pg_stat_replication into this new catalog, because those represent the
> status of a node or group of nodes.
> - group name, and by that I think that we had perhaps better make
> mandatory the need to append a name with a quorum or priority group.
> The group at the highest level is forcibly named as 'top', 'main', or
> whatever if not directly specified by the user. If the entry is
> directly a node, use the application_name.
> - Type of group, quorum or priority
> - Elements in this group, an element can be a group name or a node
> name, aka application_name. If group is of type priority, the elements
> are listed in increasing order. So the elements with lower priority
> get first, etc. We could have one column listing explicitly a list of
> integers that map with the elements of a group but it does not seem
> worth it, what users would like to know is what are the nodes that are
> prioritized. This covers the former 'priority' field of
> pg_stat_replication.
>
> We may have a good idea of how to define a custom language, still we
> are going to need to design a clean interface at catalog level more or
> less close to what is written here. If we can get a clean interface,
> the custom language implemented, and TAP tests that take advantage of
> this user interface to check the node/group statuses, I guess that we
> would be in good shape for this patch.
>
> Anyway that's not a small project, and perhaps I am over-complicating
> the whole thing.
>

I agree with adding new system catalog to easily checking replication
status for user. And group name will needed for this.
What about adding group name with ":" to immediately after set of
standbys like follows?

2[local, 2[london1, london2, london3]:london, (tokyo1, tokyo2):tokyo]

Also, regarding sync replication according to configuration, the view
I'm thinking is following definition.

=# \d pg_synchronous_replication
Column | Type | Modifiers
-------------------------+-----------+-----------
name | text |
sync_type | text |
wait_num | integer |
sync_priority | inteter |
sync_state | text |
member | text[] |
level | integer |
write_location | pg_lsn |
flush_location | pg_lsn |
apply_location | pg_lsn |

- "name" : node name or group name, or "main" meaning top level node.
- "sync_type" : 'priority' or 'quorum' for group node, otherwise NULL.
- "wait_num" : number of nodes/groups to wait for in this group.
- "sync_priority" : priority of node/group in this group. "main" node has "0".
- the standby is in quorum group always has
priority 1.
- the standby is in priority group has
priority according to definition order.
- "sync_state" : 'sync' or 'potential' or 'quorum'.
- the standby is in quorum group is always 'quorum'.
- the standby is in priority group is 'sync'
/ 'potential'.
- "member" : array of members for group node, otherwise NULL.
- "level" : nested level. "main" node is level 0.
- "write/flush/apply_location" : group/node calculated LSN according
to configuration.

When sync replication is set as above, the new system view shows,

=# select * from pg_stat_replication_group;
name | sync_type | wait_num | sync_priority | sync_state |
member | level | write_location | flush_location |
apply_location
-------------+---------------+---------------+-------------------+-----------------+---------------------------------------+-------+---------------------+---------------------+----------------
main | priority | 2 | 0 | sync
| {local,london,tokyo} | 0 |
| |
local | | 0 | 1 |
sync | | 1 |
| |
london | quorum | 2 | 2 | potential
| {london1,london2,london3} | 1 | |
|
london1 | | 0 | 1 |
potential | | 2 |
| |
london2 | | 0 | 2 |
potential | | 2 |
| |
london3 | | 0 | 3 |
potential | | 2 |
| |
tokyo | quorum | 1 | 3 | potential
| {tokyo1,tokyo2} | 1 |
| |
tokyo1 | | 0 | 1 |
quorum | | 2 |
| |
tokyo2 | | 0 | 1 |
quorum | | 2 |
| |
(9 rows)

Thought?

Regards,

--
Masahiko Sawada

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2016-02-05 09:23:17 Re: postgres_fdw join pushdown (was Re: Custom/Foreign-Join-APIs)
Previous Message Joshua Berkus 2016-02-05 09:10:22 Re: Support for N synchronous standby servers - take 2