Re: Problems rebuilding slave using pg_basebackup

From: Payal Singh <payal(at)omniti(dot)com>
To: Douglas Reed <douglas(at)fsbtech(dot)com>
Cc: pgsql-admin(at)postgresql(dot)org
Subject: Re: Problems rebuilding slave using pg_basebackup
Date: 2017-11-08 11:57:18
Message-ID: CANUg7LC8oxJLEYi2BTOvqe0iM40Grt7i5GWmWxYy_OSSZcaXPA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

On Nov 8, 2017 5:59 AM, "Douglas Reed" <douglas(at)fsbtech(dot)com> wrote:

Hi

Sorry if this email was aready received but I sent it originally from my
own email address
but received no response from the moderator so I assume that it may have
got caught in the
filter.

We are having a number of problems when we attempt to rebuild our slave
from its master

We have made about three attempts without success (using a proven set of
notes)

It's been rebuilt several times over the last few months although the time
between
pg_basebackup being keyed and it actually copying data can be up to six
minutes.

Try setting checkpoint mode to fast in the pg_basebackup command. (-c
fast) so it won't wait passively for a checkpoint before beginning
basebackup.

And after completion the time taken from database startup to psql
availability
can also be several minutes while it processes any remaining logs.

Based on how busy your primary is, this is expected. What is the WAL
generation rate approximately for your database?

Both machines are virtuals and are based with a leading cloud provider

Have you checked performance metrics like IO, CPU load, etc? Usually you
will be able to view some basic metics out of the box.

OS Linux Centos6 (6.8 Final)

pg version 9.5.4

Quite a few pg_basebackup bugs were fixed in the later minor versions,
especially 9.5.6:

Fix pg_basebackup's rate limiting in the presence of slow I/O (Antonin
Houska)

Fix possible pg_basebackup failure on standby server when including WAL
files (Amit Kapila, Robert Haas)

https://www.postgresql.org/docs/9.5/static/release-9-5-6.html

Always recommend keeping minor version up to date (9.5.9 is the latest)
since it just needs a quick restart of the database. Won't be surprised if
this alone fixes your issue.

pg WAL settings on the master database

max_wal_senders = 5
max_wal_size = 4GB
min_wal_size = 256MB
wal_block_size = 8192
wal_buffers = 1MB
wal_compression = off
wal_keep_segments = 32
wal_level = hot_standby
wal_log_hints = off
wal_receiver_status_interval = 10s
wal_receiver_timeout = 1min
wal_retrieve_retry_interval = 5s
wal_segment_size = 16MB
wal_sender_timeout = 1min
wal_sync_method = fdatasync
wal_writer_delay = 200ms

Message from pg_basebackup

[postgres(at)xxxxxxxxxx]$ pg_basebackup -h -IP_HIDDEN- -D
/var/lib/pgsql/9.5/data -P -U postgres --xlog-method=stream
pg_basebackup: could not receive data from WAL stream: server closed
the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
269061959/269164935 kB (99%), 1/1 tablespace
pg_basebackup: child process exited with error 1

Relevant error messages from master's log

Nov 7 11:52:32 o8-data1 postgres[28558]: [6-1]
user=[unknown],db=[unknown],app=[unknown]client=-IP_HIDDEN- LOG:
connection received: host=-IP_HIDDEN- port=41498
Nov 7 11:52:32 o8-data1 postgres[28558]: [7-1]
user=postgres,db=[unknown],app=[unknown]client=-IP_HIDDEN- LOG:
replication connection authorized: user=postgres
Nov 7 13:51:44 o8-data1 postgres[28558]: [8-1]
user=postgres,db=[unknown],app=pg_basebackupclient=-IP_HIDDEN- LOG: could
not send data to client: Broken pipe
Nov 7 13:51:44 o8-data1 postgres[28558]: [9-1]
user=postgres,db=[unknown],app=pg_basebackupclient=-IP_HIDDEN- ERROR: base
backup could not send data, aborting backup
Nov 7 13:51:44 o8-data1 postgres[28558]: [10-1]
user=postgres,db=[unknown],app=pg_basebackupclient=-IP_HIDDEN- FATAL:
connection to client lost
Nov 7 13:51:44 o8-data1 postgres[28558]: [11-1]
user=postgres,db=[unknown],app=pg_basebackupclient=-IP_HIDDEN- LOG:
disconnection: session time: 1:59:11.943 user=postgres database=
host=-IP_HIDDEN- port=41498

Nov 7 13:54:48 o8-data1 postgres[35445]: [6-1]
user=[unknown],db=[unknown],app=[unknown]client=-IP_HIDDEN- LOG:
connection received: host=-IP_HIDDEN- port=44040
Nov 7 13:54:48 o8-data1 postgres[35445]: [7-1]
user=postgres,db=[unknown],app=[unknown]client=-IP_HIDDEN- LOG:
replication connection authorized: user=postgres
Nov 7 15:09:20 o8-data1 postgres[35445]: [8-1]
user=postgres,db=[unknown],app=pg_basebackupclient=-IP_HIDDEN- LOG: could
not send data to client: Broken pipe
Nov 7 15:09:20 o8-data1 postgres[35445]: [9-1]
user=postgres,db=[unknown],app=pg_basebackupclient=-IP_HIDDEN- ERROR: base
backup could not send data, aborting backup
Nov 7 15:09:20 o8-data1 postgres[35445]: [10-1]
user=postgres,db=[unknown],app=pg_basebackupclient=-IP_HIDDEN- FATAL:
connection to client lost
Nov 7 15:09:20 o8-data1 postgres[35445]: [11-1]
user=postgres,db=[unknown],app=pg_basebackupclient=-IP_HIDDEN- LOG:
disconnection: session time: 1:14:31.925 user=postgres database=
host=-IP_HIDDEN- port=44040

Many thanks in advance

--
Douglas Reed
DBA
FSB Technology

What is your archive_command and full_page_writes set to? Also, what is the
value of checkpoint_segments and checkpoint_timeout?

Try increasing wal_sender_timeout before running pg_basebackup.

Also, if you are sending/storing WAL files anywhere besides the master,
once your pg_basebackup command fails, try copying those missing files
manually to path given in restore_command parameter in the secondary's
recovery.conf.

A --slot option was added to pg_basebackup in 9.6 so the command using -x
stream could connect to the replication slot used by secondary on the
master to make sure no way files go missing.

In response to

Browse pgsql-admin by date

  From Date Subject
Next Message Laurenz Albe 2017-11-08 12:28:32 Re: Problems rebuilding slave using pg_basebackup
Previous Message Douglas Reed 2017-11-08 10:09:49 Problems rebuilding slave using pg_basebackup