Re: problems using pg_start_backup/pg_stop_backup and pg_basebackup at same time

From: Alex Malek <magicagent(at)gmail(dot)com>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: pgsql-admin <pgsql-admin(at)postgresql(dot)org>
Subject: Re: problems using pg_start_backup/pg_stop_backup and pg_basebackup at same time
Date: 2016-06-20 14:15:22
Message-ID: CAGH8ccdTr_0sVpHstiR26s-v-Ns+3t9R==m9Km5mawG8JH-KWg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin pgsql-hackers

Thanks for the response Magnus.
I've replied inline below.

On Mon, Jun 20, 2016 at 4:17 AM, Magnus Hagander <magnus(at)hagander(dot)net>
wrote:

> On Mon, Jun 13, 2016 at 6:59 PM, Alex Malek <magicagent(at)gmail(dot)com> wrote:
>
>>
>> I am experiencing two problems
>> 1) pg_stop_backup hangs until pg_basebackup finishes
>> 2) WAL contains references to invalid pages
>>
>> (I suspect when 1. is fixed I won't experience 2. )
>>
>> Full description below:
>>
>>
>> postgresql: 9.3.13
>>
>> My situation is I create a writable report database every day using
>> pg_start_backup() / rsync / pg_stop_backup()
>> (whole process takes 1-2 hours, the rsync actually copies the data dir of
>> a slave/warm spare DB)
>>
>
> Taking what is called an exclusive backup, which is what pg_start_backup()
> does, is not supported off a standby node. Only pg_basebackup is supported
> against a standby node.
>
> Can you reproduce your errors if you make this backup from the master?
>

pg_start_backup() is run on the master.
The rsync copies the data dir of the standby b/c it is local.
I have tried rsyncing from the master w/ the same results.

>
>
>
>> Also once a week I create a database for backup/archive purposes using
>> pg_basebackup
>> (whole process takes about 13 hours)
>>
>>
>> These two processes used to be able to coincide until recently.
>> Recent changes include a major debian upgrade and a minor version of
>> postgres upgrade from 9.3.10.
>>
>> Now when the report sync occurs during the pg_basebackup (w/
>> --xlog-method=stream option) the pg_stop_backup() hangs until
>> the the pg_basebackup has completed (4 hours later).
>>
>
> Do you also have an archive_command, and what is it? Normally,
> pg_stop_backup() blocks on the archive command - so perhaps you have ended
> up with a dependency between that one and the base backup command somehow?
>
>
The archive command is:

archive_command = 'rsync -a %p postgres(at)archivedb
:/storage/postgres/9.3/main-archive/pg_xlog/%f'

I thought the archive command was working properly while pg_stop_backup was
hanging based on the timestamps of the files, but upon closer
inspection using stat I see that the files actually were not copied over
until pg_stop_backup stopped hanging.

Actually it appears that the archive_command is not executed the entire
time pg_basebackup is running.
pg_basebackup is started about 5 hours before pg_start_backup is executed.
Based on stat it looks like the wal files get written locally but do not
get rsynced until pg_basebackup is done.

>
>
>> A labeled WAL backup file is created when the pg_stop_backup() is first
>> executed and another
>> is created when the pg_basebackup completes.
>>
>>
>>
>> While the pg_stop_backup() hangs the following appears in the logs:
>>
>> 2016-06-11 07:50:45 EDT: WARNING: pg_stop_backup still waiting for all
>> required WAL segments to be archived (7680 seconds elapsed)
>> 2016-06-11 07:50:45 EDT: HINT: Check that your archive_command is
>> executing properly. pg_stop_backup can be canceled safely, but the
>> database backup will not be usable without all the WAL segments.
>>
>>
>> until eventually:
>>
>> 2016-06-11 12:59:07 EDT:LOG: duration: 26190082.497 ms statement:
>> SELECT pg_stop_backup()
>>
>> The archive command appears to work as WAL files are being archived
>> successfully with timestamps that occur while pg_stop_backup is hanging.
>>
>> -rw------- 1 postgres postgres 314 Jun 11 12:56
>> 000000030000109200000066.00000028.backup
>> -rw------- 1 postgres postgres 16M Jun 11 12:56 000000030000109300000029
>> -rw------- 1 postgres postgres 16M Jun 11 12:54 000000030000109300000028
>> -rw------- 1 postgres postgres 16M Jun 11 12:52 000000030000109300000027
>> <snip>
>> -rw------- 1 postgres postgres 16M Jun 11 06:22 0000000300001092000000F1
>> -rw------- 1 postgres postgres 335 Jun 11 05:42
>> 0000000300001092000000D4.00000028.backup
>> -rw------- 1 postgres postgres 16M Jun 11 05:42 0000000300001092000000F0
>>
>> Content of 0000000300001092000000D4.00000028.backup:
>>
>> START WAL LOCATION: 1092/D4000028 (file 0000000300001092000000D4)
>> STOP WAL LOCATION: 1092/F008EA10 (file 0000000300001092000000F0)
>> CHECKPOINT LOCATION: 1092/D4004E08
>> BACKUP METHOD: pg_start_backup
>> BACKUP FROM: master
>> START TIME: 2016-06-11 04:51:04 EDT
>> LABEL: reporting 2016-06-11 04:51:03.892804-04
>> STOP TIME: 2016-06-11 05:42:37 EDT
>>
>
> This file shows that the backup took about an hour, doesn't it? So it
> doesn't look like it was waiting?
>
> Or are you saying the wait happens *after* the .backup file has been
> dropped in the archives?
>

The .backup file gets written locally in a timely manner but doesn't get
dropped into the archive until after the
pg_stop_backup() stops hanging.

So it appears that pg_basebackup ( perhaps --xlog-method=stream ? ) is
causing the archive_command to pause which in turn
causes pg_stop_backup() to hang?

Thanks.

Alex

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Artem Tomyuk 2016-06-21 08:23:53 reuse tablespace from crashed PSQL instance
Previous Message Bert 2016-06-20 13:52:44 Re: WAL retention

Browse pgsql-hackers by date

  From Date Subject
Next Message Vik Fearing 2016-06-20 14:26:37 Re: primary_conninfo missing from pg_stat_wal_receiver
Previous Message Albe Laurenz 2016-06-20 14:09:57 Re: Bug in to_timestamp().