Re: Replication server timeout patch

From: Daniel Farina <daniel(at)heroku(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <drfarina(at)acm(dot)org>
Subject: Re: Replication server timeout patch
Date: 2011-02-12 04:51:04
Message-ID: AANLkTi=X+ucrE6FRNvOQDidoHVkbQ5rG212fHqz_u0yf@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Feb 11, 2011 8:20 PM, "Robert Haas" <robertmhaas(at)gmail(dot)com> wrote:
>
> On Fri, Feb 11, 2011 at 4:38 PM, Robert Haas <robertmhaas(at)gmail(dot)com>
wrote:
> > On Fri, Feb 11, 2011 at 4:30 PM, Heikki Linnakangas
> > <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> >> On 11.02.2011 22:11, Robert Haas wrote:
> >>>
> >>> On Fri, Feb 11, 2011 at 2:02 PM, Daniel Farina<drfarina(at)acm(dot)org>
wrote:
> >>>>
> >>>> I split this out of the synchronous replication patch for independent
> >>>> review. I'm dashing out the door, so I haven't put it on the CF yet
or
> >>>> anything, but I just wanted to get it out there...I'll be around in
> >>>> Not Too Long to finish any other details.
> >>>
> >>> This looks like a useful and separately committable change.
> >>
> >> Hmm, so this patch implements a watchdog, where the master disconnects
the
> >> standby if the heartbeat from the standby stops for more than
> >> 'replication_[server]_timeout' seconds. The standby sends the heartbeat
> >> every wal_receiver_status_interval seconds.
> >>
> >> It would be nice if the master and standby could negotiate those
settings.
> >> As the patch stands, it's easy to have a pathological configuration
where
> >> replication_server_timeout < wal_receiver_status_interval, so that the
> >> master repeatedly disconnects the standby because it doesn't reply in
time.
> >> Maybe the standby should report how often it's going to send a
heartbeat,
> >> and master should wait for that long + some safety margin. Or maybe the
> >> master should tell the standby how often it should send the heartbeat?
> >
> > I guess the biggest use case for that behavior would be in a case
> > where you have two standbys, one of which doesn't send a heartbeat and
> > the other of which does. Then you really can't rely on a single
> > timeout.
> >
> > Maybe we could change the server parameter to indicate what multiple
> > of wal_receiver_status_interval causes a hangup, and then change the
> > client to notify the server what value it's using. But that gets
> > complicated, because the value could be changed while the standby is
> > running.
>
> On reflection I'm deeply uncertain this is a good idea. It's pretty
> hopeless to suppose that we can keep the user from choosing parameter
> settings which will cause them problems, and there are certainly far
> stupider things they could do then set replication_timeout <
> wal_receiver_status_interval. They could, for example, set fsync=off
> or work_mem=4GB or checkpoint_segments=3 (never mind that we ship that
> last one out of the box). Any of those settings have the potential to
> thoroughly destroy their system in one way or another, and there's not
> a darn thing we can do about it. Setting up some kind of handshake
> system based on a multiple of the wal_receiver_status_interval is
> going to be complex, and it's not necessarily going to deliver the
> behavior someone wants anyway. If someone has
> wal_receiver_status_interval=10 on one system and =30 on another
> system, does it therefore follow that the timeouts should also be
> different by 3X? Perhaps, but it's non-obvious.
>
> There are two things that I think are pretty clear. If the receiver
> has wal_receiver_status_interval=0, then we should ignore
> replication_timeout for that connection. And also we need to make
> sure that the replication_timeout can't kill off a connection that is
> in the middle of streaming a large base backup. Maybe we should try
> to get those two cases right and not worry about the rest. Dan, can
> you check whether the base backup thing is a problem with this as
> implemented?

Yes, I will have something to say come Saturday.

--
fdr

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Smith 2011-02-12 04:57:02 Re: Debian readline/libedit breakage
Previous Message Robert Haas 2011-02-12 04:20:48 Re: Replication server timeout patch