Re: Support for N synchronous standby servers - take 2

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: amit(dot)kapila16(at)gmail(dot)com, michael(dot)paquier(at)gmail(dot)com, sawada(dot)mshk(at)gmail(dot)com, masao(dot)fujii(at)gmail(dot)com, jeff(dot)janes(at)gmail(dot)com, thomas(dot)munro(at)enterprisedb(dot)com, Robert Haas <robertmhaas(at)gmail(dot)com>, thom(at)linux(dot)com, memissemerson(at)gmail(dot)com, josh(at)agliodbs(dot)com, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Support for N synchronous standby servers - take 2
Date: 2016-04-27 15:10:45
Message-ID: 27732.1461769845@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> writes:
> Sorry, I have attached an empty patch. This is another one that should
> be with content.

I started to review this, and in passing came across this gem in
syncrep_scanner.l:

/*
* flex emits a yy_fatal_error() function that it calls in response to
* critical errors like malloc failure, file I/O errors, and detection of
* internal inconsistency. That function prints a message and calls exit().
* Mutate it to instead call ereport(FATAL), which terminates this process.
*
* The process that causes this fatal error should be terminated.
* Otherwise it has to abandon the new setting value of
* synchronous_standby_names and keep running with the previous one
* while the other processes switch to the new one.
* This inconsistency of the setting that each process is based on
* can cause a serious problem. Though it's basically not good idea to
* use FATAL here because it can take down the postmaster,
* we should do that in order to avoid such an inconsistency.
*/
#undef fprintf
#define fprintf(file, fmt, msg) syncrep_flex_fatal(fmt, msg)

static void
syncrep_flex_fatal(const char *fmt, const char *msg)
{
ereport(FATAL, (errmsg_internal("%s", msg)));
}

This is the faultiest reasoning possible. There are a hundred reasons why
a process might fail to absorb a GUC setting, and causing just one such
code path to FATAL out is not going to improve system stability one bit.

If you think it is absolutely imperative that all processes in the system
have identical synchronous_standby_names settings, then we need to make
it be PGC_POSTMASTER, not indulge in half-baked non-solutions like this.
But I'd like to know why that is so essential. It looks to me like what
matters is only whether each individual walsender thinks its client is
a sync standby, and so inconsistent settings between different walsenders
don't really matter. Which is a good thing, because if it's to remain
SIGHUP, you can't promise that they'll all absorb a new value at the same
instant anyway.

In short, I don't see any good reason not to make this be a plain ERROR
like it is in every other scanner in the backend.

regards, tom lane

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2016-04-27 15:45:35 Re: pgindent
Previous Message Craig Ringer 2016-04-27 15:04:37 Re: 9.6 and fsync=off