Re: Re: [pgadmin-support] Issue with a hanging apply process on the replica db after vacuum works on primary

From: Sergey Shchukin <shchukin(dot)s(dot)a(at)gmail(dot)com>
To: Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>
Cc: Radovan Jablonovsky <radovan(dot)jablonovsky(at)replicon(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: Re: [pgadmin-support] Issue with a hanging apply process on the replica db after vacuum works on primary
Date: 2015-03-17 10:22:27
Message-ID: 55080063.7020104@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgadmin-support pgsql-admin pgsql-general pgsql-hackers pgsql-performance

05.03.2015 11:25, Jim Nasby пишет:
> On 2/27/15 5:11 AM, Sergey Shchukin wrote:
>>
>> show max_standby_streaming_delay;
>> max_standby_streaming_delay
>> -----------------------------
>> 30s
>
> We both need to be more clear about which server we're talking about
> (master or replica).
>
> What are max_standby_streaming_delay and max_standby_archive_delay set
> to *on the replica*?
>
> My hope is that one or both of those is set to somewhere around 8
> minutes on the replica. That would explain everything.
>
> If that's not the case then I suspect what's happening is there's
> something running on the replica that isn't checking for interrupts
> frequently enough. That would also explain it.
>
> When replication hangs, is the replication process using a lot of CPU?
> Or is it just sitting there? What's the process status for the replay
> process show?
>
> Can you get a trace of the replay process on the replica when this is
> happening to see where it's spending all it's time?
>
> How are you generating these log lines?
> Tue Feb 24 15:05:07 MSK 2015 Stream: MASTER-masterdb:79607161592048
> SLAVE:79607161550576 Replay:79607160986064 :: REPLAY 592 KBytes
> (00:00:00.398376 seconds)
>
> Do you see the confl_* fields in pg_stat_database_conflicts on the
> *replica* increasing?

Hi Jim,

max_standby_streaming_delay and max_standby_archive_delay both are 30s
on master and replica dbs

I don't see any specific or heavy workload during this issue with a
hanging apply process. Just a normal queries as usual.

But I see an increased disk activity during the time when the apply
issue is ongoing

DSK | sdc | | *busy 61%* | read 11511
| | write 4534 | KiB/r 46 | |
KiB/w 4 | MBr/s 52.78 | | MBw/s 1.88 | avq 1.45
| | avio 0.38 ms |
DSK | sde | | *busy 60% * | read 11457
| | write 4398 | KiB/r 46 | |
KiB/w 4 | MBr/s 51.97 | | MBw/s 1.83 | avq 1.47
| | avio 0.38 ms |
DSK | sdd | |*busy 60%* | read 9673
| | write 4538 | KiB/r 61 | |
KiB/w 4 | MBr/s 58.24 | | MBw/s 1.88 | avq 1.47
| | avio 0.42 ms |
DSK | sdj | | *busy 59%* | read 9576
| | write 4177 | KiB/r 63 | |
KiB/w 4 | MBr/s 59.30 | | MBw/s 1.75 | avq 1.48
| | avio 0.43 ms |
DSK | sdh | | *busy 59%* | read 9615
| | write 4305 | KiB/r 63 | |
KiB/w 4 | MBr/s 59.23 | | MBw/s 1.80 | avq 1.48
| | avio 0.42 ms |
DSK | sdf | |*busy 59% * | read 9483
| | write 4404 | KiB/r 63 | |
KiB/w 4 | MBr/s 59.11 | | MBw/s 1.83 | avq 1.47
| | avio 0.42 ms |
DSK | sdi | | *busy 59%* | read 11273
| | write 4173 | KiB/r 46 | |
KiB/w 4 | MBr/s 51.50 | | MBw/s 1.75 | avq 1.43
| | avio 0.38 ms |
DSK | sdg | | *busy 59%* | read 11406
| | write 4297 | KiB/r 46 | |
KiB/w 4 | MBr/s 51.66 | | MBw/s 1.80 | avq 1.46
| | avio 0.37 ms |

Although it's not seems to be an upper IO limit.

Normally disks are busy at 20-45%

DSK | sde | | busy 29% | read 6524
| | write 14426 | KiB/r 26 | | KiB/w 5 |
MBr/s 17.08 | | MBw/s 7.78 | avq 10.46
| | avio 0.14 ms |
DSK | sdi | | busy 29% | read 6590
| | write 14391 | KiB/r 26 | |
KiB/w 5 | MBr/s 17.19 | | MBw/s 7.76 | avq 8.75
| | avio 0.14 ms |
DSK | sdg | | busy 29% | read 6547
| | write 14401 | KiB/r 26 | |
KiB/w 5 | MBr/s 16.94 | | MBw/s 7.60 | avq 7.28
| | avio 0.14 ms |
DSK | sdc | | busy 29% | read 6835
| | write 14283 | KiB/r 27 | |
KiB/w 5 | MBr/s 18.08 | | MBw/s 7.74 | avq 8.77
| | avio 0.14 ms |
DSK | sdf | | busy 23% | read 3808
| | write 14391 | KiB/r 36 | |
KiB/w 5 | MBr/s 13.49 | | MBw/s 7.78 | avq 12.88
| | avio 0.13 ms |
DSK | sdd | | busy 23% | read 3747
| | write 14229 | KiB/r 33 | |
KiB/w 5 | MBr/s 12.32 | | MBw/s 7.74 | avq 10.07
| | avio 0.13 ms |
DSK | sdj | | busy 23% | read 3737
| | write 14336 | KiB/r 36 | |
KiB/w 5 | MBr/s 13.16 | | MBw/s 7.76 | avq 10.48
| | avio 0.13 ms |
DSK | sdh | | busy 23% | read 3793
| | write 14362 | KiB/r 35 | |
KiB/w 5 | MBr/s 13.29 | | MBw/s 7.60 | avq 8.61
| | avio 0.13 ms |

Also during the issue perf shows [k] copy_user_generic_string on the top
positions
14.09% postmaster postgres [.] 0x00000000001b4569
* 10.25% postmaster [kernel.kallsyms] [k]
copy_user_generic_string*
4.15% postmaster postgres [.]
hash_search_with_hash_value
2.08% postmaster postgres [.] SearchCatCache
1.79% postmaster postgres [.] LWLockAcquire
1.18% postmaster libc-2.12.so [.] memcpy
1.12% postmaster postgres [.] mdnblocks

Issue starts: at 19:43
Mon Mar 16 19:43:04 MSK 2015 Stream: MASTER-rdb04d:70837172337784
SLAVE:70837172314864 Replay:70837172316512 :: REPLAY 21 KBytes
(00:00:00.006225 seconds)
Mon Mar 16 19:43:09 MSK 2015 Stream: MASTER-rdb04d:70837177455624
SLAVE:70837177390968 Replay:70837176794376 :: REPLAY 646 KBytes
(00:00:00.367305 seconds)
*Mon Mar 16 19:43:14 MSK 2015 Stream: MASTER-rdb04d:70837185005120
SLAVE:70837184961280 Replay:70837183253896 :: REPLAY 1710 KBytes
(00:00:00.827881 seconds)*
Mon Mar 16 19:43:19 MSK 2015 Stream: MASTER-rdb04d:70837190417984
SLAVE:70837190230232 Replay:70837183253896 :: REPLAY 6996 KBytes
(00:00:05.873169 seconds)
Mon Mar 16 19:43:24 MSK 2015 Stream: MASTER-rdb04d:70837198538232
SLAVE:70837198485000 Replay:70837183253896 :: REPLAY 14926 KBytes
(00:00:11.025561 seconds)
Mon Mar 16 19:43:29 MSK 2015 Stream: MASTER-rdb04d:70837209961192
SLAVE:70837209869384 Replay:70837183253896 :: REPLAY 26081 KBytes
(00:00:16.068014 seconds)

We see___[k] copy_user_generic_string_

* 12.90% postmaster [kernel.kallsyms] [k] copy_user_generic_string*
11.49% postmaster postgres [.] 0x00000000001f40c1
4.74% postmaster postgres [.]
hash_search_with_hash_value
1.86% postmaster postgres [.] mdnblocks
1.73% postmaster postgres [.] LWLockAcquire
1.67% postmaster postgres [.] SearchCatCache

* 25.71% postmaster [kernel.kallsyms] [k]
copy_user_generic_string*
7.89% postmaster postgres [.]
hash_search_with_hash_value
4.66% postmaster postgres [.] 0x00000000002108da
4.51% postmaster postgres [.] mdnblocks
3.36% postmaster [kernel.kallsyms] [k] put_page

Issue stops: at 19:51:39
Mon Mar 16 19:51:24 MSK 2015 Stream: MASTER-rdb04d:70838904179344
SLAVE:70838903934392 Replay:70837183253896 :: REPLAY 1680591 KBytes
(00:08:10.384679 seconds)
Mon Mar 16 19:51:29 MSK 2015 Stream: MASTER-rdb04d:70838929994336
SLAVE:70838929873624 Replay:70837183253896 :: REPLAY 1705801 KBytes
(00:08:15.428773 seconds)
Mon Mar 16 19:51:34 MSK 2015 Stream: MASTER-rdb04d:70838951993624
SLAVE:70838951899768 Replay:70837183253896 :: REPLAY 1727285 KBytes
(00:08:20.472567 seconds)
*Mon Mar 16**19:51:39**MSK 2015 Stream: MASTER-rdb04d:70838975297912
SLAVE:70838975180384 Replay:70837208050872 :: REPLAY 1725827 KBytes
(00:08:10.256935 seconds)*
Mon Mar 16 19:51:44 MSK 2015 Stream: MASTER-rdb04d:70839001502160
SLAVE:70839001412616 Replay:70837260116984 :: REPLAY 1700572 KBytes
(00:07:49.849511 seconds)
Mon Mar 16 19:51:49 MSK 2015 Stream: MASTER-rdb04d:70839022866760
SLAVE:70839022751184 Replay:70837276732880 :: REPLAY 1705209 KBytes
(00:07:42.307364 seconds)

And copy_user_generic_string goes down
+ 13.43% postmaster postgres [.] 0x000000000023dc9a
*+ 3.71% postmaster [kernel.kallsyms] [k]
copy_user_generic_string*
+ 2.46% init [kernel.kallsyms] [k] intel_idle
+ 2.30% postmaster postgres [.]
hash_search_with_hash_value
+ 2.01% postmaster postgres [.] SearchCatCache

Could you clarify what types of traces did you mean? GDB?

To calculate slave and apply lag I use the following query at the
replica site

slave_lag=$($psql -U monitor -h$s_host -p$s_port -A -t -c "SELECT
pg_xlog_location_diff(pg_last_xlog_receive_location(), '0/0') AS
receive" $p_db)
replay_lag=$($psql -U monitor -h$s_host -p$s_port -A -t -c "SELECT
pg_xlog_location_diff(pg_last_xlog_replay_location(), '0/0') AS replay"
$p_db)
replay_timediff=$($psql -U monitor -h$s_host -p$s_port -A -t -c "SELECT
NOW() - pg_last_xact_replay_timestamp() AS replication_delay" $p_db)
master_lag=$($psql -U monitor -h$p_host -p$p_port -A -t -c "SELECT
pg_xlog_location_diff(pg_current_xlog_location(), '0/0') AS offset" $p_db)
echo "$(date) Stream: MASTER-$p_host:$master_lag SLAVE:$slave_lag
Replay:$replay_lag :: REPLAY $(bc <<< $master_lag/1024-$replay_lag/1024)
KBytes (${replay_timediff} seconds)"

-
Best regards,
Sergey Shchukin

In response to

Responses

Browse pgadmin-support by date

  From Date Subject
Next Message Henry Katz 2015-03-17 18:25:59 Incorrect version for pgadmin3 on desktop invocation (KDE)
Previous Message Michal Kozusznik 2015-03-13 11:17:40 Servers group

Browse pgsql-admin by date

  From Date Subject
Next Message Craig Ringer 2015-03-17 13:03:19 Re: pgpool + BDR, is it possible?
Previous Message Ruth Melendo 2015-03-17 09:52:33 Re: pgpool + BDR, is it possible?

Browse pgsql-general by date

  From Date Subject
Next Message Serge Fonville 2015-03-17 12:27:37 Re: BDR with Postgres
Previous Message Ruth Melendo 2015-03-17 09:52:33 Re: pgpool + BDR, is it possible?

Browse pgsql-hackers by date

  From Date Subject
Next Message Jackson Isaac 2015-03-17 12:49:01 GSoC 2015 Idea Discussion
Previous Message Ashutosh Bapat 2015-03-17 10:12:57 Re: Custom/Foreign-Join-APIs (Re: [v9.5] Custom Plan API)

Browse pgsql-performance by date

  From Date Subject
Next Message Vivekanand Joshi 2015-03-17 11:07:41 Re: Performance issues
Previous Message Vivekanand Joshi 2015-03-17 07:41:27 Re: Performance issues