Re: Quorum commit for multiple synchronous replication.

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Vik Fearing <vik(at)2ndquadrant(dot)fr>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Quorum commit for multiple synchronous replication.
Date: 2016-12-12 12:52:47
Message-ID: CAHGQGwHBa3hU+a8uONkHKEr=KsW8w8O38z_f=q0cgYeXNQNwvw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Dec 12, 2016 at 9:31 PM, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> On Sat, Dec 10, 2016 at 5:17 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>> On Thu, Dec 8, 2016 at 3:02 PM, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>>> On Thu, Dec 8, 2016 at 4:39 PM, Michael Paquier
>>> <michael(dot)paquier(at)gmail(dot)com> wrote:
>>>> On Thu, Dec 8, 2016 at 9:07 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>>>> You could do that, but first I would code up the simplest, cleanest
>>>>> algorithm you can think of and see if it even shows up in a 'perf'
>>>>> profile. Microbenchmarking is probably overkill here unless a problem
>>>>> is visible on macrobenchmarks.
>>>>
>>>> This is what I would go for! The current code is doing a simple thing:
>>>> select the Nth element using qsort() after scanning each WAL sender's
>>>> values. And I think that Sawada-san got it right. Even running on my
>>>> laptop a pgbench run with 10 sync standbys using a data set that fits
>>>> into memory, SyncRepGetOldestSyncRecPtr gets at most 0.04% of overhead
>>>> using perf top on a non-assert, non-debug build. Hash tables and
>>>> allocations get a far larger share. Using the patch,
>>>> SyncRepGetSyncRecPtr is at the same level with a quorum set of 10
>>>> nodes. Let's kick the ball for now. An extra patch could make things
>>>> better later on if that's worth it.
>>>
>>> Yeah, since the both K and N could be not large these algorithm takes
>>> almost the same time. And current patch does simple thing. When we
>>> need over 100 or 1000 replication node the optimization could be
>>> required.
>>> Attached latest v9 patch.
>>>
>>
>> Few comments:
>
> Thank you for reviewing.
>
>> + * In 10.0 we support two synchronization methods, priority and
>> + * quorum. The number of synchronous standbys that transactions
>> + * must wait for replies from and synchronization method are specified
>> + * in synchronous_standby_names. This parameter also specifies a list
>> + * of standby names, which determines the priority of each standby for
>> + * being chosen as a synchronous standby. In priority method, the standbys
>> + * whose names appear earlier in the list are given higher priority
>> + * and will be considered as synchronous. Other standby servers appearing
>> + * later in this list represent potential synchronous standbys. If any of
>> + * the current synchronous standbys disconnects for whatever reason,
>> + * it will be replaced immediately with the next-highest-priority standby.
>> + * In quorum method, the all standbys appearing in the list are
>> + * considered as a candidate for quorum commit.
>>
>> In the above description, is priority method represented by FIRST and
>> quorum method by ANY in the synchronous_standby_names syntax? If so,
>> it might be better to write about it explicitly.
>
> Added description.
>
>>
>> 2.
>> --- a/src/backend/utils/sort/tuplesort.c
>> +++ b/src/backend/utils/sort/tuplesort.c
>> @@ -2637,16 +2637,8 @@ mergeruns(Tuplesortstate *state)
>> }
>>
>> /*
>> - * Allocate a new 'memtuples' array, for the heap. It will hold one tuple
>> - * from each input tape.
>> - */
>> - state->memtupsize = numInputTapes;
>> - state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
>> - USEMEM(state, GetMemoryChunkSpace(state->memtuples));
>> -
>> - /*
>> - * Use all the remaining memory we have available for read buffers among
>> - * the input tapes.
>> + * Use all the spare memory we have available for read buffers among the
>> + * input tapes.
>>
>> This doesn't belong to this patch.
>
> Oops, fixed.
>
>> 3.
>> + * Return the list of sync standbys using according to synchronous method,
>>
>> In above sentence, "using according" seems to either incomplete or wrong usage.
>
> Fixed.
>
>> 4.
>> + else
>> + ereport(ERROR,
>> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
>> + "invalid synchronization method is specified \"%d\"",
>> + SyncRepConfig->sync_method));
>>
>> Here, the error message doesn't seem to aligned and you might want to
>> use errmsg for the same.
>>
>
> Fixed.
>
>> 5.
>> + * In priority method, we need the oldest these positions among sync
>> + * standbys. In quorum method, we need the newest these positions
>> + * specified by SyncRepConfig->num_sync.
>>
>> /oldest these/oldest of these
>> /newest these positions specified/newest of these positions as specified
>
> Fixed.
>
>> Instead of newest, can we consider to use latest?
>
> Yeah, I changed it so.
>
>
>> 6.
>> + int sync_method; /* synchronization method */
>> /* member_names contains nmembers consecutive nul-terminated C strings */
>> char member_names[FLEXIBLE_ARRAY_MEMBER];
>> } SyncRepConfigData;
>>
>> Can't we use 1 or 2 bytes to store sync_method information?
>
> I changed it to uint8.
>
> Attached latest v10 patch incorporated the review comments so far.
> Please review it.

Thanks for updating the patch!

Do we need to update postgresql.conf.sample?

+{any_ident} {
+ yylval.str = pstrdup(yytext);
+ return ANY;
+ }
+{first_ident} {
+ yylval.str = pstrdup(yytext);
+ return FIRST;
+ }

Why is pstrdup(yytext) necessary here?

+ standby_list { $$ =
create_syncrep_config("1", $1, SYNC_REP_PRIORITY); }
+ | NUM '(' standby_list ')' { $$ =
create_syncrep_config($1, $3, SYNC_REP_QUORUM); }
+ | ANY NUM '(' standby_list ')' { $$ =
create_syncrep_config($2, $4, SYNC_REP_QUORUM); }
+ | FIRST NUM '(' standby_list ')' { $$ =
create_syncrep_config($2, $4, SYNC_REP_PRIORITY); }

Isn't this "partial" backward-compatibility (i.e., "NUM (list)" works
differently from curent version while "list" works in the same way as
current one) very confusing?

I prefer to either of

1. break the backward-compatibility, i.e., treat the first syntax of
"standby_list" as quorum commit
2. keep the backward-compatibility, i.e., treat the second syntax of
"NUM (standby_list)" as sync rep with the priority

+ <literal>pg_stat_replication</></link> view). If the keyword
+ <literal>FIRST</> is specified, other standby servers appearing
+ later in this list represent potential synchronous standbys.
+ If any of the current synchronous standbys disconnects for
+ whatever reason, it will be replaced immediately with the
+ next-highest-priority standby. Specifying more than one standby
+ name can allow very high availability.

It seems strange to explain the behavior of FIRST before explaining
the syntax of synchronous_standby_names and FIRST.

Regards,

--
Fujii Masao

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2016-12-12 12:56:05 Re: pgcrypto compilation error due to stack-allocated EVP_CIPHER_CTX
Previous Message Masahiko Sawada 2016-12-12 12:31:42 Re: Quorum commit for multiple synchronous replication.