Re: BUG #5851: ROHS (read only hot standby) needs to be restarted manually in somecases.

From: "mark" <dvlhntr(at)gmail(dot)com>
To: "'Fujii Masao'" <masao(dot)fujii(at)gmail(dot)com>
Cc: "'Robert Haas'" <robertmhaas(at)gmail(dot)com>, <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #5851: ROHS (read only hot standby) needs to be restarted manually in somecases.
Date: 2011-02-09 00:23:16
Message-ID: 058b01cbc7ef$8cfaa9e0$a6effda0$@com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs


> -----Original Message-----
> From: Fujii Masao [mailto:masao(dot)fujii(at)gmail(dot)com]
> Sent: Tuesday, February 08, 2011 4:00 PM
> To: mark
> Cc: Robert Haas; pgsql-bugs(at)postgresql(dot)org
> Subject: Re: [BUGS] BUG #5851: ROHS (read only hot standby) needs to be
> restarted manually in somecases.
>
> On Wed, Feb 9, 2011 at 6:36 AM, mark <dvlhntr(at)gmail(dot)com> wrote:
> > this is the recovery.conf file, see any problems with it? maybe I
> > didn't do some syntax right right ?
> >
> > [postgres@<redacted> data9.0]$ cat recovery.conf
> > standby_mode = 'on'
> > primary_conninfo = 'host=<redacted> port=5432 user=postgres
> > keepalives_idle=30 keepalives_interval=30 keepalives_count=30'
>
> This setting would lead TCP keepalive to take about 930 seconds
> (= 30 + 30 * 30) to detect the network outage. If you want to stop
> replication as soon as the outage happens, you need to decrease
> the keepalive setting values.

What numbers would you suggest? I have been guessing and probably doing a
very poor job of it.

I am turning knobs and not getting any meaningful changes with respect to in
my problem. So either I am not turning them correctly, or they are not the
right knobs for my problem.

Trying to fix my own ignorance here. (should I move this off the bugs list,
since maybe it's not a bug?)

The settings have been unspecified in the recovery file, it's been
specified in the recovery file, and I have tried the following in the
recovery file:

(~two weeks and it died)
keepalives_idle=0
keepalives_interval=0
keepalives_count=0

(~two weeks and it dies)
keepalives_idle=30
keepalives_interval=30
keepalives_count=30

(this didn't work either, don't recall how long this lasted, maybe a month)

keepalives_idle=2100
keepalives_interval=0
keepalives_count=0

Background is basically this: trying to do streaming replication over a WAN,
probably ship about 5GB of changes per day, hardware on both ends can easily
keep up with that. Running over a shared metro line and have about 3-5MBytes
per second depending on the time of day that I can count on. I have wal_keep
segments at 250 (I don't care about the disk overhead for this, since I
wanted to not have to use wal archiving). The link is being severed more
often than usually lately while some network changes are being made so while
I would expect that improve in the future this isn't exactly the most
reliable connection. so getting whatever as right as I can is of value to
me.

Typically I see the streaming replication break down for good completely a
few hours after something that causes a interruption in networking. Nagios
notifications lag some but not hours and has to go through a few people
before I find out about it. When checking the nagios pages on their logs I
don't see pages about the distance between the master and the standby
getting bigger during this time, and then once I see the first unexpected
EOF then the distance between the master and standby gets further and
further until it gets fixed or we have to re-sync the whole base over.

Again I can't seem to duplicate this problem on demand with virtual
machines, I startup a master and standby, setup streaming rep, kickoff a
multi hour or day pg bench run and start messing with networking. Every time
I try and duplicate this synthetically the standby picks right back where it
left off and catches back up.

I am at a loss so I do appreciate everyone's help.

Thanks in advance

-Mark

>
> Regards,
>
> --
> Fujii Masao
> NIPPON TELEGRAPH AND TELEPHONE CORPORATION
> NTT Open Source Software Center

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Matthew Zinicola 2011-02-09 01:50:31 Re: BUG #5862: Postgres dumps core upon a connection attempt
Previous Message Fujii Masao 2011-02-08 22:59:45 Re: BUG #5851: ROHS (read only hot standby) needs to be restarted manually in somecases.