Re: Postgres slave not catching up (on 9.2)

From: Ruben Domingo Gaspar Aparicio <Ruben(dot)Gaspar(dot)Aparicio(at)cern(dot)ch>
To: Robert Klemme <shortcutter(at)googlemail(dot)com>
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>, "dbondemand-admin (DB On Demand administrators)" <dbondemand-admin(at)cern(dot)ch>
Subject: Re: Postgres slave not catching up (on 9.2)
Date: 2014-12-15 14:18:13
Message-ID: 6FCBD86BDC789C43AC696969EC340228010942123C@CERNXCHG44.cern.ch
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


> > The slave (I don't have control on the master) is using 2 NFS file
> > systems, one for WALs and another one for the data, on Netapp controllers:
> >
> > dbnasg401-12a:/vol/dodpupdbtst02 on /ORA/dbs02/PUPDBTST type nfs
> >
> (rw,remount,hard,nointr,rsize=32768,wsize=32768,tcp,actimeo=0,timeo=60
> > 0)
> >
> > dbnasg403-12a:/vol/dodpupdbtst03 on /ORA/dbs03/PUPDBTST type nfs
> >
> (rw,remount,hard,nointr,rsize=32768,wsize=32768,tcp,actimeo=0,timeo=60
> > 0)
>
> You should use noatime to avoid unnecessary IO.
>

Just to mention that changing the mount points from:

dbnasg403-12a:/vol/dodpupdbtst03 on /ORA/dbs03/PUPDBTST type nfs (rw, actimeo=0,hard,nointr,rsize=65536,wsize=65536,tcp,timeo=600)
dbnasg401-12a:/vol/dodpupdbtst02 on /ORA/dbs02/PUPDBTST type nfs (rw, actimeo=0,hard,nointr,rsize=65536,wsize=65536,tcp,timeo=600)

to

dbnasg403-12a:/vol/dodpupdbtst03 on /ORA/dbs03/PUPDBTST type nfs (rw,noatime,hard,nointr,rsize=65536,wsize=65536,tcp,timeo=600)
dbnasg401-12a:/vol/dodpupdbtst02 on /ORA/dbs02/PUPDBTST type nfs (rw,noatime,hard,nointr,rsize=65536,wsize=65536,tcp,timeo=600)

it did have a big impact. The profile of the recovery process on terms of calls changed quite a lot:

From:

[postgres(at)itrac1202 tmp]$ strace -p 9596 -c
Process 9596 attached - interrupt to quit
Process 9596 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
78.73 0.217824 0 456855 381376 read
17.87 0.049453 0 515320 lseek
2.89 0.007989 12 669 669 poll
0.33 0.000912 25 36 open
0.07 0.000206 0 994 994 stat
0.05 0.000151 0 995 787 rt_sigreturn
0.05 0.000133 0 673 write
0.00 0.000000 0 36 close
0.00 0.000000 0 52 kill
------ ----------- ----------- --------- --------- ----------------
100.00 0.276668 975630 383826 total

To:

[postgres(at)itrac1202 tmp]$ strace -p 9596 -c
Process 9596 attached - interrupt to quit
Process 9596 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
78.73 0.217824 0 456855 381376 read
17.87 0.049453 0 515320 lseek
2.89 0.007989 12 669 669 poll
0.33 0.000912 25 36 open
0.07 0.000206 0 994 994 stat
0.05 0.000151 0 995 787 rt_sigreturn
0.05 0.000133 0 673 write
0.00 0.000000 0 36 close
0.00 0.000000 0 52 kill
------ ----------- ----------- --------- --------- ----------------
100.00 0.276668 975630 383826 total

We did also increased the shared_buffers from 12 to 24GB.

The lag has decreased most of the time:

*/10 * * * * /usr/local/pgsql/postgresql-9.2.9/bin/psql -U postgres -h /var/lib/pgsql/ -p 6600 -d puppetdb -c "SELECT now(), now() - pg_last_xact_replay_timestamp() AS time_lag" | perl -ne 'if (/\|\s+(\d{2}):(\d{2}):(\d{2})\.\d+/) {$hour=$1;$min=$2;$sec=$3; print $_;}' >> /tmp/lag929morememmount.log

...
2014-12-14 14:10:01.688947+01 | 00:00:00.096524
2014-12-14 14:20:01.798223+01 | 00:00:00.024083
2014-12-14 14:30:01.884448+01 | 00:00:00.420791
2014-12-14 14:40:01.960623+01 | 00:00:00.168318
2014-12-14 14:50:01.191487+01 | 00:00:00.163832
2014-12-14 15:00:02.146436+01 | 00:00:00.026934
2014-12-14 15:10:01.277963+01 | 00:00:00.332185
2014-12-14 15:20:01.353979+01 | 00:00:00.020616
2014-12-14 15:30:01.417092+01 | 00:00:00.584768
2014-12-14 15:40:01.575347+01 | 00:00:00.151685
2014-12-14 15:50:01.205507+01 | 00:00:00.102073
2014-12-14 16:00:01.321511+01 | 00:00:00.590677
2014-12-14 16:10:01.570474+01 | 00:00:00.182683
2014-12-14 16:20:01.640095+01 | 00:00:00.420185
2014-12-14 16:30:01.767033+01 | 00:00:00.015989
2014-12-14 16:40:01.849532+01 | 00:00:00.106296
2014-12-14 16:50:01.920876+01 | 00:00:00.258851
2014-12-14 17:00:02.000278+01 | 00:00:00.119841
2014-12-14 17:10:01.894227+01 | 00:00:00.091599
2014-12-14 17:20:01.61729+01 | 00:00:00.367367
2014-12-14 17:30:01.683326+01 | 00:00:00.103884
2014-12-14 17:40:01.755904+01 | 00:00:00.051262
2014-12-14 17:50:01.833825+01 | 00:00:00.06901
2014-12-14 18:00:01.901236+01 | 00:00:00.17467
2014-12-14 18:10:01.186283+01 | 00:00:00.214941
2014-12-14 18:20:01.145413+01 | 00:00:00.03517
2014-12-14 18:30:01.241746+01 | 00:00:00.207842
2014-12-14 18:40:01.299413+01 | 00:00:00.147878
2014-12-14 18:50:01.368541+01 | 00:00:00.393893
2014-12-14 19:00:01.430736+01 | 00:00:00.031226
2014-12-14 19:10:01.672117+01 | 00:05:03.512832
2014-12-14 19:20:01.9195+01 | 00:06:39.08761
2014-12-14 19:30:02.184486+01 | 00:00:00.307668
2014-12-14 19:40:01.227278+01 | 00:00:00.054831
2014-12-14 19:50:01.305485+01 | 00:00:00.425595
2014-12-14 20:00:01.410501+01 | 00:00:00.394526
2014-12-14 20:10:01.984196+01 | 00:00:00.388844
2014-12-14 20:20:01.031042+01 | 00:00:00.503092
2014-12-14 20:30:01.225871+01 | 00:00:00.241493
2014-12-14 20:40:01.305696+01 | 00:00:00.280656
2014-12-14 20:50:01.379617+01 | 00:00:00.151103
2014-12-14 21:00:01.468849+01 | 00:00:00.014412
2014-12-14 21:10:01.724514+01 | 00:00:00.147476
2014-12-14 21:20:01.799292+01 | 00:00:00.08696
2014-12-14 21:30:01.866336+01 | 00:00:00.035226
2014-12-14 21:40:01.942882+01 | 00:00:00.111701
2014-12-14 21:50:02.010419+01 | 00:00:00.215121
2014-12-14 22:00:01.110033+01 | 00:00:16.460612
2014-12-14 22:10:01.568286+01 | 00:00:00.077897
2014-12-14 22:20:01.682714+01 | 00:00:00.104112
2014-12-14 22:30:01.758958+01 | 00:00:00.061474
2014-12-14 22:40:01.970545+01 | 00:00:00.108613
2014-12-14 22:50:01.038908+01 | 00:00:00.039637
2014-12-14 23:00:01.120295+01 | 00:00:00.338731
2014-12-14 23:10:01.365371+01 | 00:00:00.680065
2014-12-14 23:20:01.423365+01 | 00:00:00.154614
2014-12-14 23:30:01.48998+01 | 00:00:00.014643
2014-12-14 23:40:01.569452+01 | 00:00:00.126961
2014-12-14 23:50:01.63047+01 | 00:00:00.303156
2014-12-15 00:00:01.278047+01 | 00:00:00.351391
2014-12-15 00:10:01.382566+01 | 00:00:00.012265
2014-12-15 00:20:01.444746+01 | 00:07:39.002651
2014-12-15 00:30:01.510413+01 | 00:16:13.476753
2014-12-15 00:40:01.97735+01 | 00:00:00.105011
2014-12-15 00:50:01.082528+01 | 00:01:10.313796
2014-12-15 01:00:01.124843+01 | 00:00:01.508016
2014-12-15 01:10:01.818415+01 | 00:00:00.082441
2014-12-15 01:20:01.961064+01 | 00:00:00.048221
2014-12-15 01:30:01.355472+01 | 00:00:00.37941
2014-12-15 01:40:01.42728+01 | 00:00:00.013836
2014-12-15 01:50:01.486446+01 | 00:00:00.110321
2014-12-15 02:00:01.566731+01 | 00:00:00.290281
2014-12-15 02:10:01.236574+01 | 00:01:15.954532
2014-12-15 02:20:01.440259+01 | 00:00:00.471677
2014-12-15 02:30:01.5733+01 | 00:00:00.208574
2014-12-15 02:40:01.662425+01 | 00:00:00.591091
2014-12-15 02:50:01.263385+01 | 00:00:00.050648
2014-12-15 03:00:01.340777+01 | 00:00:00.289115
2014-12-15 03:10:01.993079+01 | 00:00:00.790201
2014-12-15 03:20:01.061826+01 | 00:00:00.043176
2014-12-15 03:30:01.125639+01 | 00:00:00.172924
2014-12-15 03:40:01.252033+01 | 00:03:05.113579
2014-12-15 03:50:01.362396+01 | 00:00:00.254974
2014-12-15 04:00:01.370922+01 | 00:00:00.208254
2014-12-15 04:10:01.472816+01 | 00:00:00.077214
2014-12-15 04:20:01.553443+01 | 00:00:00.135887
2014-12-15 04:30:01.63607+01 | 00:00:00.027272
2014-12-15 04:40:01.696442+01 | 00:00:00.130954
2014-12-15 04:50:01.786961+01 | 00:00:00.572573
2014-12-15 05:00:01.790753+01 | 00:00:00.491799
2014-12-15 05:10:01.078332+01 | 00:07:58.438202 **** likely autovacuum
2014-12-15 05:20:01.139541+01 | 00:00:00.057486
2014-12-15 05:30:01.251079+01 | 00:00:00.053462
2014-12-15 05:40:01.322349+01 | 00:00:00.084701
2014-12-15 05:50:01.607937+01 | 00:00:00.205241
2014-12-15 06:00:01.699406+01 | 00:00:00.121415
2014-12-15 06:10:01.756047+01 | 00:00:00.20769
2014-12-15 06:20:01.854222+01 | 00:00:00.03397
2014-12-15 06:30:02.041054+01 | 00:03:07.271295
2014-12-15 06:40:01.891882+01 | 00:00:00.263748
2014-12-15 06:50:01.987809+01 | 00:00:00.155619
2014-12-15 07:00:01.068556+01 | 00:00:00.119866
2014-12-15 07:10:01.318478+01 | 00:00:00.092856
2014-12-15 07:20:01.704899+01 | 00:00:00.106533
2014-12-15 07:30:01.773268+01 | 00:00:00.135743
2014-12-15 07:40:01.730152+01 | 00:00:00.06358
2014-12-15 07:50:01.798179+01 | 00:00:00.529685
2014-12-15 08:00:01.868205+01 | 00:00:00.194482
2014-12-15 08:10:01.219339+01 | 00:00:00.063553
2014-12-15 08:20:01.309426+01 | 00:00:00.056698
2014-12-15 08:30:01.120431+01 | 00:00:00.425596
2014-12-15 08:40:01.201882+01 | 00:00:00.00909
2014-12-15 08:50:01.272526+01 | 00:00:00.019492
2014-12-15 09:00:01.361022+01 | 00:00:00.423997
2014-12-15 09:10:01.603702+01 | 00:00:00.066705
2014-12-15 09:20:01.682277+01 | 00:00:09.251202
2014-12-15 09:30:01.934477+01 | 00:00:00.311553
2014-12-15 09:40:02.03221+01 | 00:00:00.125678
2014-12-15 09:50:01.105372+01 | 00:00:00.294006
2014-12-15 10:00:01.201109+01 | 00:00:00.014641
2014-12-15 10:10:01.164478+01 | 00:01:51.375378
2014-12-15 10:20:01.264589+01 | 00:09:54.476361 **** likely autovacuum
2014-12-15 10:30:01.351103+01 | 00:00:00.213636
2014-12-15 10:40:01.623903+01 | 00:00:00.488103
2014-12-15 10:50:01.768132+01 | 00:00:00.080799
2014-12-15 11:00:01.880247+01 | 00:00:20.401738
2014-12-15 11:10:01.215509+01 | 00:00:00.036288
2014-12-15 11:20:01.265607+01 | 00:00:00.057142
2014-12-15 11:30:01.343731+01 | 00:00:00.036609
2014-12-15 11:40:01.41248+01 | 00:00:00.218139
2014-12-15 11:50:01.48113+01 | 00:00:00.242754
2014-12-15 12:00:01.685114+01 | 00:00:00.82528
2014-12-15 12:10:01.995243+01 | 00:02:29.971448
2014-12-15 12:20:01.962833+01 | 00:00:00.118112
2014-12-15 12:30:01.100587+01 | 00:00:00.214437
2014-12-15 12:40:01.226111+01 | 00:00:00.052599
2014-12-15 12:50:01.300061+01 | 00:00:00.162205
2014-12-15 13:00:01.4007+01 | 00:00:00.707891
2014-12-15 13:10:02.005526+01 | 00:00:00.162238
2014-12-15 13:20:01.072375+01 | 00:00:00.214978
2014-12-15 13:30:01.446005+01 | 00:00:00.121816
2014-12-15 13:40:01.483524+01 | 00:00:00.650178
2014-12-15 13:50:01.796143+01 | 00:00:00.065482
2014-12-15 14:00:01.886071+01 | 00:00:00.237577
2014-12-15 14:10:01.134148+01 | 00:00:00.193941
2014-12-15 14:20:01.199047+01 | 00:00:00.068058
2014-12-15 14:30:01.27777+01 | 00:00:00.022991
2014-12-15 14:40:01.361959+01 | 00:00:00.439753
2014-12-15 14:50:01.421515+01 | 00:00:00.037749
2014-12-15 15:00:01.500559+01 | 00:00:00.174448
2014-12-15 15:10:01.811804+01 | 00:06:09.196648 **** likely autovacuum
..

It goes up till a maximum of 25 minutes (for the last two weeks), it looks correlated with an autovacuum at the master in one of the big tables of the schema. It happens at about 5hours interval. Is there a way to avoid this ? Should I ask to the master db dba to try to have a more active autovacuum policy?

Thank you,
Ruben

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Ruben Domingo Gaspar Aparicio 2014-12-15 14:32:02 Re: Postgres slave not catching up (on 9.2)
Previous Message Graeme B. Bell 2014-12-15 13:36:45 Re: Tuning the configuration