Re: Hot standby read slaves exceed max delay on WAL segment. Replication lag.

From: Shaun Duncan <shaun(dot)duncan(at)gmail(dot)com>
To: Venkata Balaji Nagothi <vbnpgc(at)gmail(dot)com>
Cc: "pgsql-general(at)postgresql(dot)org >> PG-General Mailing List" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Hot standby read slaves exceed max delay on WAL segment. Replication lag.
Date: 2014-03-20 12:28:07
Message-ID: CAPHHVJ5N40yQZ_6E_PV1R9dYt3=zYTQ5A7FU0vkq4LJ7KbOG5g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Wed, Mar 19, 2014 at 11:31 PM, Venkata Balaji Nagothi
<vbnpgc(at)gmail(dot)com>wrote:

> On Thu, Mar 20, 2014 at 8:19 AM, Shaun Duncan <shaun(dot)duncan(at)gmail(dot)com>wrote:
>
>>
>> On Mar 19, 2014 5:08 PM, "Venkata Balaji Nagothi" <vbnpgc(at)gmail(dot)com>
>> wrote:
>> >
>> >
>> > On Thu, Mar 20, 2014 at 4:12 AM, Shaun Duncan <shaun(dot)duncan(at)gmail(dot)com>
>> wrote:
>> >
>> >> This is on a production 9.0.15 install with 1 master, 5 hot standby
>> read only slaves.
>> >>
>> >> We've been trying to look into a situation where we're seeing that hot
>> standby read slaves are receiving WAL segments, but are exceeding
>> max_standby_archive_delay (60s) and max_standby_streaming_delay (60s) and
>> not applying changes. The slave will get the first segment and hang (we've
>> seen up to 30m before removing the slave from our read pool to catch up)
>> and get further and further behind the master. Furthermore, we have seen
>> that after a slave has caught up, putting it back into the read pool will
>> mean will almost immediately start to see this happen again. It acts as if
>> we had max_standby_* set to -1.
>> >>
>> >> I'm just looking for some ideas, hints, or suggestions as to what
>> might be going on here or what we might be doing wrong.
>> >>
>> >
>> > Have you noticed if the read requests on standby database are taking
>> long and loading up the server ? If yes, any idea why the queries on
>> standby are taking long to complete ?
>> >
>> > WAL replay would be paused ( until the read requests are served ) at
>> the time of conflicting queries on the standby site.
>> >
>> > Also, Do you see any delay in master sending the WALs to standby ?
>> >
>> >
>> > Venkata Balaji N
>> >
>> > Sr. Database Administrator
>> > Fujitsu Australia
>> >
>> >
>> >
>> >
>>
>> Yes we have seen queries against the read slave degrade severely while
>> this happens. We have at least optimized some inefficient queries. But I
>> would think (correct me if I am wrong) that queries would be canceled after
>> WAL replay has exceeded the max delay. This is something we haven't seen.
>> It behaves as if it is waiting for reads to complete.
>>
>
> No. Generally queries continue to run on Standby without any issues,
> except in the below 2 situations -
>
>
> - When *conflicting queries* (like DROP TABLE, DROP DATABASE etc..)
> are executed on master whilst standby is serving the read requests. This
> behavior can be controlled to some extent by setting max_standby_*_delay
> parameters (60s in your case). If there are any read requests in process at
> standby database,the conflicting queries would wait for 60s - if the read
> requests (Generally SELECTs) do not complete, then they are forcibly
> cancelled
>
> In-short max_standby_*_delay parameters only effect the
> replication behavior when conflicting queries are coming to standby.
>
> - When master is waiting to apply VACUUM related changes (removing
> dead rows) at the standby site. This behavior can be controlled to some
> extent by "vacuum_defer_cleanup_age". This implies number of transactions
> executed on the master database.
>
>
> If there are no conflicting queries from primary waiting to be applied at
> standby database - then queries would continue to run without any issues
> until
> statement_timeout ( if configured ) threshold is reached.
>
> Do you see any messages like "ERROR: canceling statement due to
> conflict... " in standby postgres logs ?
>
>
> http://www.postgresql.org/docs/9.0/static/hot-standby.html#HOT-STANDBY-CONFLICT
>
> Hope this helps !
>
> Venkata Balaji N
>
> Sr.Database Administrator
> Fujitsu Australia
>
>

Yes, we do see statements being canceled from time to time, just not when
we experience a lag situation with WALs not being applied. Is it just
entirely possible that there is too much work for one of the slaves to do
(hence the degraded query throughput we see) that it is unable to apply
WALs? In other words, there are no conflicts that occur, but WALs cannot be
applied an a timely manner? There's a noticeable pattern that in times of
lag, the overall number of slow queries dramatically increases (i.e. from a
few hundred to tens of thousands).

Thanks!

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message basti 2014-03-20 12:48:47 Backup WAL Replication Server
Previous Message Granthana Biswas 2014-03-20 12:27:24 Dead rows not getting removed during vacuum