Re: Replication server timeout patch

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Daniel Farina <drfarina(at)acm(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication server timeout patch
Date: 2011-02-12 04:20:48
Message-ID: AANLkTi=rnYRRq2rucXhGBVHzWhQ=_Fj5bDPiNxrAe+ks@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Feb 11, 2011 at 4:38 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Feb 11, 2011 at 4:30 PM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> On 11.02.2011 22:11, Robert Haas wrote:
>>>
>>> On Fri, Feb 11, 2011 at 2:02 PM, Daniel Farina<drfarina(at)acm(dot)org>  wrote:
>>>>
>>>> I split this out of the synchronous replication patch for independent
>>>> review. I'm dashing out the door, so I haven't put it on the CF yet or
>>>> anything, but I just wanted to get it out there...I'll be around in
>>>> Not Too Long to finish any other details.
>>>
>>> This looks like a useful and separately committable change.
>>
>> Hmm, so this patch implements a watchdog, where the master disconnects the
>> standby if the heartbeat from the standby stops for more than
>> 'replication_[server]_timeout' seconds. The standby sends the heartbeat
>> every wal_receiver_status_interval seconds.
>>
>> It would be nice if the master and standby could negotiate those settings.
>> As the patch stands, it's easy to have a pathological configuration where
>> replication_server_timeout < wal_receiver_status_interval, so that the
>> master repeatedly disconnects the standby because it doesn't reply in time.
>> Maybe the standby should report how often it's going to send a heartbeat,
>> and master should wait for that long + some safety margin. Or maybe the
>> master should tell the standby how often it should send the heartbeat?
>
> I guess the biggest use case for that behavior would be in a case
> where you have two standbys, one of which doesn't send a heartbeat and
> the other of which does.  Then you really can't rely on a single
> timeout.
>
> Maybe we could change the server parameter to indicate what multiple
> of wal_receiver_status_interval causes a hangup, and then change the
> client to notify the server what value it's using.  But that gets
> complicated, because the value could be changed while the standby is
> running.

On reflection I'm deeply uncertain this is a good idea. It's pretty
hopeless to suppose that we can keep the user from choosing parameter
settings which will cause them problems, and there are certainly far
stupider things they could do then set replication_timeout <
wal_receiver_status_interval. They could, for example, set fsync=off
or work_mem=4GB or checkpoint_segments=3 (never mind that we ship that
last one out of the box). Any of those settings have the potential to
thoroughly destroy their system in one way or another, and there's not
a darn thing we can do about it. Setting up some kind of handshake
system based on a multiple of the wal_receiver_status_interval is
going to be complex, and it's not necessarily going to deliver the
behavior someone wants anyway. If someone has
wal_receiver_status_interval=10 on one system and =30 on another
system, does it therefore follow that the timeouts should also be
different by 3X? Perhaps, but it's non-obvious.

There are two things that I think are pretty clear. If the receiver
has wal_receiver_status_interval=0, then we should ignore
replication_timeout for that connection. And also we need to make
sure that the replication_timeout can't kill off a connection that is
in the middle of streaming a large base backup. Maybe we should try
to get those two cases right and not worry about the rest. Dan, can
you check whether the base backup thing is a problem with this as
implemented?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Farina 2011-02-12 04:51:04 Re: Replication server timeout patch
Previous Message Robert Haas 2011-02-12 04:10:27 Re: [pgsql-general 2011-1-21:] Are there any projects interested in object functionality? (+ rule bases)