Re: replication lag despite corrective config

From: Wyatt Alt <wyatt(dot)alt(at)gmail(dot)com>
To: Rene Romero Benavides <rene(dot)romero(dot)b(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: replication lag despite corrective config
Date: 2018-11-21 05:08:03
Message-ID: afff2e6c-d072-da94-0a6e-a530f448e9a1@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi Rene,

On 11/19/18 8:46 PM, Rene Romero Benavides wrote:
> Not sure about the root cause but I can make these observations and
> raise some questions:
> 1) 9.6.6 is five bug fix versions behind
Valid point to raise.
> 2) 300GB is so big a table, wouldn't make sense to you to partition it ?
> 2a) or if it's partitioned, doesn't the time of creation or dropping
> of new partitions match the time of the conflict?
Partitioning is in the works, but none at the moment.
>
> 3) can you track long running transactions on the master?
> 4) what are the isolation levels on master / replica?

Transaction times on the master max out around two minutes. On the
replica they are much longer -- numerous 1 - 2 hour transactions per
day, and occasional ones as long as 10 - 20 hours. Isolation levels are
read committed everywhere.

> 5) check for active locks in the replica, I guess you should see some
> blocked transactions during big delays, I've seen this in the past
> when standby_feedback is turned off.
> 6) any out of the ordinary messages in the replica's logs? any
> evidence that has been canceling statements?

I'll make a note to record the active locks next time. I haven't seen
anything unusual in the logs during these incidents, but have observed
statements getting canceled at other times, which is why I think the
config mostly works.

> 7) are master and replica exactly the same in terms of resources and
> main parameters?
> 8) how is performance in both nodes while the big delay is happening?
> IO / cpu load / etc.

This brings up a good detail I forgot to mention originally. During the
last incident, IO utilization on the replica was near 100%, and had been
for several hours, which I believe was due to the long queries I
canceled.  Now that I think about it, I wonder if the lag may have
arisen from IO contention between the query and WAL replay, rather than
a query conflict.

>
> Also, check this out:
> https://www.alibabacloud.com/forum/read-383
>
Thanks, interesting reading.
>
> Am Mo., 19. Nov. 2018 um 21:46 Uhr schrieb Wyatt Alt
> <wyatt(dot)alt(at)gmail(dot)com <mailto:wyatt(dot)alt(at)gmail(dot)com>>:
>
> Sorry, I see now there was a similar question a few days ago:
> https://www.postgresql.org/message-id/CAJw4d1WtzOdYzd8Nq2=uFK+Z0JY0L_pfg9TvCWPrmt3NCZq9GA@mail.gmail.com
>
> Two ideas proposed (aside from disconnects):
> * Autovacuum is truncating a page on the master and taking an
> AccessExclusiveLock on the table in use on the replica
> * A "pin conflict", which I'm still unfamiliar with.
>
> The user's response says they are in the first bucket, but the
> argument relies on max_standby_streaming_delay set to -1, while
> mine is 5 minutes. I need to understand pin conflicts better, but
> the likely scenario Andrew outlined doesn't apply to me. My
> offending queries were doing bitmap heap scans on a 300GB table.
>
> Reading the thread I see Andres ask for the "precise conflict" the
> user gets -- is there a way I can get that without a datadir? And
> to re-frame the original question, are there causes of replication
> lag that max_standby_streaming_delay would not be expected to
> prevent, that would be resolved by killing long standby queries?
> If so, what's the best way to confirm?
>
> Wyatt
>
> On Mon, Nov 19, 2018 at 5:46 PM Wyatt Alt <wyatt(dot)alt(at)gmail(dot)com
> <mailto:wyatt(dot)alt(at)gmail(dot)com>> wrote:
>
> I've been struggling to eliminate replication lag on a
> Postgres 9.6.6 instance on Amazon RDS. I believe the lag is
> caused by early cleanup conflicts from vacuums on the master,
> because I can reliably resolve it by killing long-running
> queries on the standby. I most recently saw ten hours of lag
> on Saturday and addressed it this way.
>
> The standby is running with
> hot_standby_feedback = on
> max_standby_streaming_delay = 5min
> max_standby_archive_delay = 30s
>
> I am not using replication slots on the primary due to
> reported negative interactions with pg_repack on large tables.
>
> My rationale for the first two settings is that
> hot_standby_feedback should address my issues almost all the
> time, but that max_standby_streaming_delay would sometimes be
> necessary as a fallback, for instance in cases of a transient
> connection loss between the standby and primary. I believe
> these settings are mostly working, because lag is less
> frequent than it was when I configured them.
>
> My questions are,
> * Am I overlooking anything in my configuration?
> * What would explain lag caused by query conflicts given the
> max_standby_streaming_delay setting? Shouldn't those queries
> be getting killed?
> * Is there any particular diagnostic info I should be
> collecting on the next occurrence, to help me figure out the
> cause? Note that as I'm on RDS, I don't have direct access to
> the datadir -- just psql.
>
> Thanks for any advice!
> Wyatt
>
>
>
> --
> El genio es 1% inspiración y 99% transpiración.
> Thomas Alva Edison
> http://pglearn.blogspot.mx/
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Mohammed Siddiqui 2018-11-21 06:16:27 ODBC Driver for 9.4
Previous Message Wyatt Alt 2018-11-21 02:36:23 Re: replication lag despite corrective config