Hot Backup with rsync fails at pg_clog if under load

From: Linas Virbalas <linas(dot)virbalas(at)continuent(dot)com>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Cc: "daniel(at)heroku(dot)com" <daniel(at)heroku(dot)com>
Subject: Hot Backup with rsync fails at pg_clog if under load
Date: 2011-09-21 14:44:30
Message-ID: CA9FD2FE.1D8D2%linas.virbalas@continuent.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

* Context *

I'm observing problems with provisioning a standby from the master by
following a basic and documented "Making a Base Backup" [1] procedure with
rsync if, in the mean time, heavy load is applied on the master.

After searching the archives, the only more discussed and similar issue I
found hit was by Daniel Farina in a thread "hot backups: am I doing it
wrong, or do we have a problem with pg_clog?" [2], but, it seems, the issue
was discarded because of a non-standard backup procedure Deniel used.
However, I'm observing the same error with a simple procedure, hence this
message.

* Details *

Procedure:

1. Start load generator on the master (WAL archiving enabled).
2. Prepare a Streaming Replication standby (accepting WAL files too):
2.1. pg_switch_xlog() on the master;
2.2. pg_start_backup(Obackup_under_load¹) on the master (this will take a
while as master is loaded up);
2.3. rsync data/global/pg_control to the standby;
2.4. rsync all other data/ (without pg_xlog) to the standby;
2.5. pg_stop_backup() on the master;
2.6. Wait to receive all WAL files, generated during the backup, on the
standby;
2.6. Start the standby PG instance.

The last step will, usually, fail with a similar error:

2011-09-21 13:41:05 CEST LOG: database system was interrupted; last known
up at 2011-09-21 13:40:50 CEST
Restoring 00000014.history
mv: cannot stat `/opt/PostgreSQL/9.1/archive/00000014.history': No such file
or directory
Restoring 00000013.history
2011-09-21 13:41:05 CEST LOG: restored log file "00000013.history" from
archive
2011-09-21 13:41:05 CEST LOG: entering standby mode
Restoring 0000001300000006000000DC
2011-09-21 13:41:05 CEST LOG: restored log file "0000001300000006000000DC"
from archive
Restoring 0000001300000006000000DB
2011-09-21 13:41:05 CEST LOG: restored log file "0000001300000006000000DB"
from archive
2011-09-21 13:41:05 CEST FATAL: could not access status of transaction
1188673
2011-09-21 13:41:05 CEST DETAIL: Could not read from file "pg_clog/0001" at
offset 32768: Success.
2011-09-21 13:41:05 CEST LOG: startup process (PID 13819) exited with exit
code 1
2011-09-21 13:41:05 CEST LOG: aborting startup due to startup process
failure

The procedure works very reliably if there is little or no load on the
master, but fails very often with the pg_clog error when load generator (few
thousands of SELECTs, ~60 INSERTs, ~60 DELETEs and ~60 UPDATES per second)
is started up.

I assumed that a file system backup taken during pg_start_backup and
pg_stop_backup is guaranteed to be consistent and that missing pieces will
be taken from the WAL files, generated & shipped during the backup, but is
it really?

Is this procedure missing some steps? Or maybe this a known issue?

Thank you,
Linas

[1] http://www.postgresql.org/docs/current/static/continuous-archiving.html
[2] http://archives.postgresql.org/pgsql-hackers/2011-04/msg01132.php

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2011-09-21 14:50:22 Re: Inlining comparators as a performance optimisation
Previous Message Tom Lane 2011-09-21 14:41:39 Re: Range Types - typo + NULL string constructor