Re: Support for N synchronous standby servers

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Rajeev rastogi <rajeev(dot)rastogi(at)huawei(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Support for N synchronous standby servers
Date: 2014-09-11 03:40:08
Message-ID: CAB7nPqSyaBh5wvXwkuA34NSf0uTXwjD=hqfZcfMt9Wgd4RJuwg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Sep 11, 2014 at 5:21 AM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> On 08/28/2014 10:10 AM, Michael Paquier wrote:
>>
>> + #synchronous_standby_num = -1 # number of standbys servers using sync
>> rep
>
>
> To be honest, that's a horrible name for the GUC. Back when synchronous
> replication was implemented, we had looong discussions on this feature. It
> was called "quorum commit" back then. I'd suggest using the "quorum" term in
> this patch, too, that's a fairly well-known term in distributed computing
> for this.
I am open to any suggestions. Then what about the following parameter names?
- synchronous_quorum_num
- synchronous_standby_quorum
- synchronous_standby_quorum_num
- synchronous_quorum_commit

> When synchronous replication was added, quorum was left out to keep things
> simple; the current feature set was the most we could all agree on to be
> useful. If you search the archives for "quorum commit" you'll see what I
> mean. There was a lot of confusion on what is possible and what is useful,
> but regarding this particular patch: people wanted to be able to describe
> more complicated scenarios. For example, imagine that you have a master and
> two standbys in one the primary data center, and two more standbys in a
> different data center. It should be possible to specify that you must get
> acknowledgment from at least on standby in both data centers. Maybe you
> could hack that by giving the standbys in the same data center the same
> name, but it gets ugly, and it still won't scale to even more complex
> scenarios.

Currently two nodes can only have the same priority if they have the
same application_name, so we could for example add a new connstring
parameter called, let's say application_group, to define groups of
nodes that will have the same priority (if a node does not define
application_group, it defaults to application_name, if app_name is
NULL, well we don't care much it cannot be a sync candidate). That's a
first idea that we could use to control groups of nodes. And we could
switch syncrep.c to use application_group in s_s_names instead of
application_name. That would be backward-compatible, and could open
the door for more improvements for quorum commits as we could control
groups node nodes. Well this is a super-set of what application_name
can already do, but there is no problem to identify single nodes of
the same data center and how much they could be late in replication,
so I think that this would be really user-friendly. An idea similar to
that would be a base work for the next thing... See below.

Now, in your case the two nodes on the second data center need to have
the same priority either way. With this patch you can achieve that
with the same node name. Where things are not that cool with this
patch is something like that though:
- 5 slaves: 1 with master (node_local), 2 on a 2nd data center
(node_center1), 2 last on a 3rd data center (node_center2)
- s_s_num = 3
- s_s_names = 'node_local,node_center1,node_center2'

In this case the nodes have the following priority:
- node_local => 1
- the 2 nodes with node_center1 => 2
- the 2 nodes with node_center2 => 3
In this {1,2,2,3,3} schema, the patch makes system wait for
node_local, and the two nodes in node_center1 without caring about the
ones in node_center2 as it will pick up only the nodes with the
highest priority. If user expects the system to wait for a node in
node_center2 he'll be disappointed. That's perhaps where we could
improve things, by adding an extra parameter able to control the
priority ranks, say synchronous_priority_check:
- [absolute|individual], wait for the first s_s_num nodes having the
lowest priority, in this case we'll wait for {1,2,2}
- group: for only one node in the lowest s_s_num priorities, here
we'll wait for {1,2,3}
Note that we may not even need this parameter if we assume by default
that we wait for only one node in a given group that has the same
priority.

> Maybe that's OK - we don't necessarily need to solve all scenarios at once.
> But it's worth considering.

Parametrizing and coverage of the user expectations are tricky. Either
way not everybody can be happy :) There are even people unhappy now
because we can only define one single sync node.

> BTW, how does this patch behave if there are multiple standbys connected
> with the same name?

All the nodes have the same priority. For example in the case of a
cluster with 5 slaves having the same application name and s_s_num =3,
the first three nodes when scanning the WAL sender array are expected
to return a COMMIT before committing locally:
=# show synchronous_standby_num ;
synchronous_standby_num
-------------------------
3
(1 row)
=# show synchronous_standby_names ;
synchronous_standby_names
---------------------------
node
(1 row)
=# SELECT application_name, client_port,
pg_xlog_location_diff(sent_location, flush_location) AS replay_delta,
sync_priority, sync_state
FROM pg_stat_replication ORDER BY replay_delta ASC, appl
application_name | client_port | replay_delta | sync_priority | sync_state
------------------+-------------+--------------+---------------+------------
node | 50251 | 0 | 1 | sync
node | 50252 | 0 | 1 | sync
node | 50253 | 0 | 1 | sync
node | 50254 | 0 | 1 | potential
node | 50255 | 0 | 1 | potential
(5 rows)

After writing this long message, and thinking more about that, I kind
of like the group approach. Thoughts welcome.
Regards,
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip kumar 2014-09-11 03:40:31 Re: pg_basebackup vs. Windows and tablespaces
Previous Message Xiaoyulei 2014-09-11 03:08:56 about half processes are blocked by btree, btree is bottleneck?