Re: Cascading replication: should we detect/prevent cycles?

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cascading replication: should we detect/prevent cycles?
Date: 2013-02-02 18:41:12
Message-ID: CA+TgmoZdO4qZyubHv1tXRUiiRT5s9UiM8tRLF78PGs-iL3xJ8Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jan 31, 2013 at 9:48 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> On 02/01/2013 12:01 PM, Josh Berkus wrote:
>>> If we're going to start installing safeguards against doing stupid
>>> things, there's a long list of scenarios that happen far more
>>> regularly than this ever will and cause far more damage.
>>
>> What's wrong with making it easier for sysadmins to troubleshoot things?
>> Again, I'm not talking about erroring out, I'm talking about logging a
>> warning.
>
> Or to put it another way: Robert, you just did a "nobody wants that" to
> me. I thought you were opposed to such things on this list.

I respectfully disagree. I'm saying that *I* don't want that, which I
think is different. To interpret my opposition against saying "nobody
wants that" to mean "you can never oppose anything someone else thinks
is a good idea" would preclude meaningful dialogue on most of what we
talk about here. And clearly there is at least some demand for this
feature, because you and Craig Ringer both want it. So let me try to
restate my objection to this specific feature more clearly.

I think that we should be careful about warning the user about things
that might not actually be mistakes. I'm not aware that we currently
issue ANY warnings of that type. When we emit error messages, we
sometimes suggest one possible cause of the error, and such messages
are clearly labelled as HINT. But we don't, for example, emit an
error or a WARNING or ERROR about a DELETE or UPDATE statement that
lacks a WHERE clause, even though many people might like to have such
a feature. We don't warn a user "hey, float8 is imprecise, consider
using numeric" or "hey, numeric is slow, consider using float8" or
"setting autovacuum_naptime to an hour is probably dummer than pouring
sugar in your gas tank", even though all of those things are true and
some people might like to be warned. We only warn or error out when
something happens that we are 100% sure is bad. And, in this
particular case, it has been suggested that there are legitimate
reasons why a replication topology might temporarily involve loops, so
I believe this fails that criterion.

Second, we have often discussed the importance of avoiding log spam.
Warnings that are likely to be repeated a large number of times when
they occur have repeatedly been voted down on those grounds. I
believe that objection also applies to this case. It is more
appropriate to make information about the status of the system
available via some status-inquiry function; for example, if you were
to recast this as adding a slave-side function that attempts to return
the IP of the current master, or NULL if no master, that would answer
this objection (but not necessarily all of the other ones).

Third, we usually apply a criterion that warnings or errors must
represent conditions that we can reliably detect; in other words, we
typically do not add checks for situations that we will only sometimes
be able to identify. And, in this case, it's a little unclear how we
would actually identify loops. Presumably, we'd do it by sending a
chain of unique per-node identifiers along with the WAL, and looking
for your own identifier in the path, but we don't have any sort of
unique per-node identifier right now, and how would you create one?
If someone shuts down the cluster, duplicates it, and starts up both
copies, we want that to work. Any identifier embedded in the cluster
by such a process would be duplicated. You could use something like
the node IP and port number, which wouldn't have that pitfall, but as
we all know, IPs can be duplicated (e.g. due to NAT) so this isn't
necessarily reliable either. If you do come up with a suitable unique
per-node identifier, then this is fairly simple to make work for
streaming replication, but it's tricky to see how to make it work with
archiving.

Is that more clear?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2013-02-02 18:49:13 Re: autovacuum not prioritising for-wraparound tables
Previous Message Andres Freund 2013-02-02 18:38:09 Re: GetOldestXmin going backwards is dangerous after all