Skip site navigation (1) Skip section navigation (2)

Re: Cascading replication: should we detect/prevent cycles?

From: Joshua Berkus <josh(at)agliodbs(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Subject: Re: Cascading replication: should we detect/prevent cycles?
Date: 2013-01-05 20:50:05
Message-ID: 1508954314.133342.1357419005538.JavaMail.root@agliodbs.com (view raw or flat)
Thread:
Lists: pgsql-hackers
Robert,

> I'm sure it's possible; I don't *think* it's terribly easy.  The
> usual
> algorithm for cycle detection is to have each node send to the next
> node the path that the data has taken.  But, there's no unique
> identifier for each slave that I know of - you could use IP address,
> but that's not really unique.  And, if the WAL passes through an
> archive, how do you deal with that?  

Not that I know how to do this, but it seems like a more direct approach is to check whether there's a master anywhere up the line.  Hmmmm.  Still sounds fairly difficult.

> I'm sure somebody could figure
> all of this stuff out, but it seems fairly complicated for the
> benefit
> we'd get.  I just don't think this is going to be a terribly common
> problem; if it turns out I'm wrong, I may revise my opinion.  :-)

I don't think it'll be that common either.  The problem is that when it does happen, it'll be very hard for the hapless sysadmin involved to troubleshoot.

> To me, it seems that lag monitoring between master and standby is
> something that anyone running a complex replication configuration
> should be doing - and yeah, I think anything involving four standbys
> (or cascading) qualifies as complex.  If you're doing that, you
> should
> notice pretty quickly that your replication lag is increasing
> steadily.  

There are many reasons why replication lag would increase steadily.

> You might also check pg_stat_replication the master and
> notice that there are no connections there any more. 

Well, if you've created a true cycle, every server has one or more replicas.  The original case I presented was the most probably cause of accidental cycles: the original master dies, and the on-call sysadmin accidentally connects the first replica to the last replica while trying to recover the cluster.

AFAICT, the only way to troubleshoot a cycle is to test every server in the network to see if it's a master and has replicas, and if no server is a master with replicas, it's a cycle.  Again, not fast or intuitive.

 Could someone
> miss those tell-tale signs?  Sure.  But they could also set
> autovacuum_naptime to an hour and then file a support ticket
> complaining that about table bloat - and they do.  Personally, as
> user
> screw-ups go, I'd consider that scenario (and its fourteen cousins,
> twenty-seven second cousins, and three hundred and ninety two other
> extended family members) as higher-priority and lower effort to fix
> than this particular thing.

I agree that this isn't a particularly high-priority issue.  I do think it should go on the TODO list, though, just in case we get a GSOC student or other new contributor who wants to tackle it.

--Josh




In response to

pgsql-hackers by date

Next:From: Peter GeogheganDate: 2013-01-05 21:21:11
Subject: Re: Cascading replication: should we detect/prevent cycles?
Previous:From: Stephen FrostDate: 2013-01-05 19:31:53
Subject: Re: Re: Proposal: Store "timestamptz" of database creation on "pg_database"

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group