Re: Support for N synchronous standby servers - take 2

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thom Brown <thom(at)linux(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Support for N synchronous standby servers - take 2
Date: 2016-04-02 01:20:55
Message-ID: CAEepm=2sdDL2hs3XbWb5FORegNHBObLJ-8C2=aaeG-riZTd2Rw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Mar 31, 2016 at 5:11 PM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> On Thu, Mar 31, 2016 at 3:55 AM, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>> On Wed, Mar 30, 2016 at 11:43 PM, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>>> On Tue, Mar 29, 2016 at 5:36 PM, Kyotaro HORIGUCHI
>>> <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>>>> I personally don't think it needs such a survive measure. It is
>>>> very small syntax and the parser reads very short text. If the
>>>> parser failes in such mode, something more serious should have
>>>> occurred.
>>>>
>>>> At Tue, 29 Mar 2016 16:51:02 +0900, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote in <CAHGQGwFth8pnYhaLBx0nF8o4qmwctdzEOcWRqEu7HOwgdJGa3g(at)mail(dot)gmail(dot)com>
>>>>> On Tue, Mar 29, 2016 at 4:23 PM, Kyotaro HORIGUCHI
>>>>> <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>>>>> > Hello,
>>>>> >
>>>>> > At Mon, 28 Mar 2016 18:38:22 +0900, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote in <CAD21AoAJMDV1EUKMfeyaV24arx4pzUjGHYbY4ZxzKpkiCUvh0Q(at)mail(dot)gmail(dot)com>
>>>>> > sawada.mshk> On Mon, Mar 28, 2016 at 5:50 PM, Kyotaro HORIGUCHI
>>>>> >> <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>>>>> > As mentioned in my comment, SQL parser converts yy_fatal_error
>>>>> > into ereport(ERROR), which can be caught by the upper PG_TRY (by
>>>>> > #define'ing fprintf). So it is doable if you mind exit().
>>>>>
>>>>> I'm afraid that your idea doesn't work in postmaster. Because ereport(ERROR) is
>>>>> implicitly promoted to ereport(FATAL) in postmaster. IOW, when an internal
>>>>> flex fatal error occurs, postmaster just exits instead of jumping out of parser.
>>>>
>>>> If The ERROR may be LOG or DEBUG2 either, if we think the parser
>>>> fatal erros are recoverable. guc-file.l is doing so.
>>>>
>>>>> ISTM that, when an internal flex fatal error occurs, it's
>>>>> better to elog(FATAL) and terminate the problematic
>>>>> process. This might lead to the server crash (e.g., if
>>>>> postmaster emits a FATAL error, it and its all child processes
>>>>> will exit soon). But probably we can live with this because the
>>>>> fatal error basically rarely happens.
>>>>
>>>> I agree to this
>>>>
>>>>> OTOH, if we make the process keep running even after it gets an internal
>>>>> fatal error (like Sawada's patch or your idea do), this might cause more
>>>>> serious problem. Please imagine the case where one walsender gets the fatal
>>>>> error (e.g., because of OOM), abandon new setting value of
>>>>> synchronous_standby_names, and keep running with the previous setting value.
>>>>> OTOH, the other walsender processes successfully parse the setting and
>>>>> keep running with new setting. In this case, the inconsistency of the setting
>>>>> which each walsender is based on happens. This completely will mess up the
>>>>> synchronous replication.
>>>>
>>>> On the other hand, guc-file.l seems ignoring parser errors under
>>>> normal operation, even though it may cause similar inconsistency,
>>>> if any..
>>>>
>>>> | LOG: received SIGHUP, reloading configuration files
>>>> | LOG: input in flex scanner failed at file "/home/horiguti/data/data_work/postgresql.conf" line 1
>>>> | LOG: configuration file "/home/horiguti/data/data_work/postgresql.conf" contains errors; no changes were applied
>>>>
>>>>> Therefore, I think that it's better to make the problematic process exit
>>>>> with FATAL error rather than ignore the error and keep it running.
>>>>
>>>> +1. Restarting walsender would be far less harmful than keeping
>>>> it running in doubtful state.
>>>>
>>>> Sould I wait for the next version or have a look on the latest?
>>>>
>>>
>>> Attached latest patch incorporate some review comments so far, and is
>>> rebased against current HEAD.
>>>
>>
>> Sorry I attached wrong patch.
>> Attached patch is correct patch.
>>
>> [mulit_sync_replication_v21.patch]
>
> Here are some TPS numbers from some quick tests I ran on a set of
> Amazon EC2 m3.large instances ("2 vCPU" virtual machines) configured
> as primary + 3 standbys, to try out different combinations of
> synchronous_commit levels and synchronous_standby_names numbers. They
> were run for a short time only and these are of course systems with
> limited and perhaps uneven IO and CPU, but they still give some idea
> of the trends. And reassuringly, the trends are travelling in the
> expected directions.
>
> All default settings except shared_buffers = 1GB, and the GUCs
> required for replication.
>
> pgbench postgres -j2 -c2 -N bench2 -T 600
>
> 1(*) 2(*) 3(*)
> ==== ==== ====
> off = 4056 4096 4092
> local = 1323 1299 1312
> remote_write = 1130 1046 958
> on = 860 744 701
> remote_apply = 785 725 604
>
> pgbench postgres -j16 -c16 -N bench2 -T 600
>
> 1(*) 2(*) 3(*)
> ==== ==== ====
> off = 3952 3943 3933
> local = 2964 2984 3026
> remote_write = 2790 2724 2675
> on = 2731 2627 2523
> remote_apply = 2627 2501 2432
>
> One thing I noticed is that there are LOG messages telling me when a
> standby becomes a synchronous standby, but nothing to tell me if a
> standby stops being a standby (ie because a higher priority one has
> taken its place in the quorum). Would that be interesting?
>
> Also, I spotted some tiny mistakes:
>
> + <indexterm zone="high-availability">
> + <primary>Dedicated language for multiple synchornous replication</primary>
> + </indexterm>
>
> s/synchornous/synchronous/
>
> + /*
> + * If we are managing the sync standby, though we weren't
> + * prior to this, then announce we are now the sync standby.
> + */
>
> s/ the / a / (two occurrences)
>
> + ereport(LOG,
> + (errmsg("standby \"%s\" is now the synchronous standby with priority %u",
> + application_name, MyWalSnd->sync_standby_priority)));
>
> s/ the / a /
>
> offered by a transaction commit. This level of protection is referred
> - to as 2-safe replication in computer science theory.
> + to as 2-safe replication in computer science theory, and group-1-safe
> + (group-safe and 1-safe) when <varname>synchronous_commit</> is set to
> + more than <literal>remote_write</>.
>
> Why "more than"? I think those two words should be changed to "at
> least", or removed.
>
> + <para>
> + This syntax allows us to define a synchronous group that will wait for at
> + least N standbys of them, and a comma-separated list of group
> members that are surrounded by
> + parantheses. The special value <literal>*</> for server name
> matches any standby.
> + By surrounding list of group members using parantheses,
> synchronous standbys are chosen from
> + that group using priority method.
> + </para>
>
> s/parantheses/parentheses/ (two occurrences)
>
> + <sect2 id="dedicated-language-for-multi-sync-replication-priority">
> + <title>Prioirty Method</title>
>
> s/Prioirty Method/Priority Method/

A couple more comments:

/*
- * If we aren't managing the highest priority standby then just leave.
+ * If the number of sync standbys is less than requested or we aren't
+ * managing the sync standby then just leave.
*/
- if (syncWalSnd != MyWalSnd)
+ if (!got_oldest || !am_sync)

s/ the sync / a sync /

+ /*
+ * Consider all pending standbys as sync if the number of them plus
+ * already-found sync ones is lower than the configuration requests.
+ */
+ if (list_length(result) + list_length(pending) <= SyncRepConfig->num_sync)
+ return list_concat(result, pending);

The cells from 'pending' will be attached to 'result', and 'result'
will be freed by the caller. But won't the List header object from
'pending' be leaked?

+ result = lappend_int(result, i);
+ if (list_length(result) == SyncRepConfig->num_sync)
+ {
+ list_free(pending);
+ return result; /* Exit if got enough sync standbys */
+ }

If we didn't take the early return in the list-not-long-enough case
mentioned above, we should *always* exit via this return statement,
right? Since we know that the pending list had enough elements to
reach num_sync. I think that is worth a comment, and also a "not
reached" comment at the bottom of the function, if it is true.

As a future improvement, I wonder if we could avoid recomputing the
current set of sync standbys in every walsender every time we call
SyncRepReleaseWaiters, perhaps by maintaining that set incrementally
in shmem when walsender states change etc.

I don't have any other comments, other than to say: thank you to all
the people who have contributed to this feature so far and I really
really hope it goes into 9.6!

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2016-04-02 01:23:54 Re: pg_dump dump catalog ACLs
Previous Message Peter Geoghegan 2016-04-02 00:29:03 Re: Using quicksort for every external sort run