Re: Quorum commit for multiple synchronous replication.

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Vik Fearing <vik(at)2ndquadrant(dot)fr>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Quorum commit for multiple synchronous replication.
Date: 2016-12-12 12:31:42
Message-ID: CAD21AoD8nibM68vt1XBPFYd5ZU61m5D_Cp1rQqsEbH1kETR75A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Dec 10, 2016 at 5:17 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> On Thu, Dec 8, 2016 at 3:02 PM, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>> On Thu, Dec 8, 2016 at 4:39 PM, Michael Paquier
>> <michael(dot)paquier(at)gmail(dot)com> wrote:
>>> On Thu, Dec 8, 2016 at 9:07 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>>> You could do that, but first I would code up the simplest, cleanest
>>>> algorithm you can think of and see if it even shows up in a 'perf'
>>>> profile. Microbenchmarking is probably overkill here unless a problem
>>>> is visible on macrobenchmarks.
>>>
>>> This is what I would go for! The current code is doing a simple thing:
>>> select the Nth element using qsort() after scanning each WAL sender's
>>> values. And I think that Sawada-san got it right. Even running on my
>>> laptop a pgbench run with 10 sync standbys using a data set that fits
>>> into memory, SyncRepGetOldestSyncRecPtr gets at most 0.04% of overhead
>>> using perf top on a non-assert, non-debug build. Hash tables and
>>> allocations get a far larger share. Using the patch,
>>> SyncRepGetSyncRecPtr is at the same level with a quorum set of 10
>>> nodes. Let's kick the ball for now. An extra patch could make things
>>> better later on if that's worth it.
>>
>> Yeah, since the both K and N could be not large these algorithm takes
>> almost the same time. And current patch does simple thing. When we
>> need over 100 or 1000 replication node the optimization could be
>> required.
>> Attached latest v9 patch.
>>
>
> Few comments:

Thank you for reviewing.

> + * In 10.0 we support two synchronization methods, priority and
> + * quorum. The number of synchronous standbys that transactions
> + * must wait for replies from and synchronization method are specified
> + * in synchronous_standby_names. This parameter also specifies a list
> + * of standby names, which determines the priority of each standby for
> + * being chosen as a synchronous standby. In priority method, the standbys
> + * whose names appear earlier in the list are given higher priority
> + * and will be considered as synchronous. Other standby servers appearing
> + * later in this list represent potential synchronous standbys. If any of
> + * the current synchronous standbys disconnects for whatever reason,
> + * it will be replaced immediately with the next-highest-priority standby.
> + * In quorum method, the all standbys appearing in the list are
> + * considered as a candidate for quorum commit.
>
> In the above description, is priority method represented by FIRST and
> quorum method by ANY in the synchronous_standby_names syntax? If so,
> it might be better to write about it explicitly.

Added description.

>
> 2.
> --- a/src/backend/utils/sort/tuplesort.c
> +++ b/src/backend/utils/sort/tuplesort.c
> @@ -2637,16 +2637,8 @@ mergeruns(Tuplesortstate *state)
> }
>
> /*
> - * Allocate a new 'memtuples' array, for the heap. It will hold one tuple
> - * from each input tape.
> - */
> - state->memtupsize = numInputTapes;
> - state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
> - USEMEM(state, GetMemoryChunkSpace(state->memtuples));
> -
> - /*
> - * Use all the remaining memory we have available for read buffers among
> - * the input tapes.
> + * Use all the spare memory we have available for read buffers among the
> + * input tapes.
>
> This doesn't belong to this patch.

Oops, fixed.

> 3.
> + * Return the list of sync standbys using according to synchronous method,
>
> In above sentence, "using according" seems to either incomplete or wrong usage.

Fixed.

> 4.
> + else
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + "invalid synchronization method is specified \"%d\"",
> + SyncRepConfig->sync_method));
>
> Here, the error message doesn't seem to aligned and you might want to
> use errmsg for the same.
>

Fixed.

> 5.
> + * In priority method, we need the oldest these positions among sync
> + * standbys. In quorum method, we need the newest these positions
> + * specified by SyncRepConfig->num_sync.
>
> /oldest these/oldest of these
> /newest these positions specified/newest of these positions as specified

Fixed.

> Instead of newest, can we consider to use latest?

Yeah, I changed it so.

> 6.
> + int sync_method; /* synchronization method */
> /* member_names contains nmembers consecutive nul-terminated C strings */
> char member_names[FLEXIBLE_ARRAY_MEMBER];
> } SyncRepConfigData;
>
> Can't we use 1 or 2 bytes to store sync_method information?

I changed it to uint8.

Attached latest v10 patch incorporated the review comments so far.
Please review it.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment Content-Type Size
000_quorum_commit_v10.patch text/x-diff 27.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2016-12-12 12:52:47 Re: Quorum commit for multiple synchronous replication.
Previous Message Rafa de la Torre 2016-12-12 11:35:08 Re: Fix for segfault in plpython's exception handling