Re: Hot Backup with rsync fails at pg_clog if under load

From: Chris Redekop <chris(at)replicon(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Florian Pflug <fgp(at)phlo(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Hot Backup with rsync fails at pg_clog if under load
Date: 2011-10-25 21:06:14
Message-ID: CAC2SuRLssGbdMBVwCwoQTJ9jt8L9oEvVPai++dR3d6EuTznw7g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> Chris, can you rearrange the backup so you copy the pg_control file as
> the first act after the pg_start_backup?

I tried this and it doesn't seem to make any difference. I also tried the
patch and I can no longer reproduce the subtrans error, however instead it
now it starts up, but never gets to the point where it'll accept
connections. It starts up but if I try to do anything I always get "FATAL:
the database system is starting up"...even if the load is removed from the
primary, the standby still never finishes "starting up". Attached below is
a log of one of these startup attempts. In my testing with the patch
applied approx 3 in 10 attempts start up successfully, 7 in 10 attempts go
into the "db is starting up" state....the pg_clog error is still there, but
seems much harder to reproduce now....I've seen it only once since applying
the patch (out of probably 50 or 60 under-load startup attempts). It does
seem to be "moody" like that tho....it will be very difficult to reproduce
for a while, and then it will happen damn-near every time for a
while...weirdness

On a bit of a side note, I've been thinking of changing my scripts so that
they perform an initial rsync prior to doing the
startbackup-rsync-stopbackup just so that the second rsync will be
faster....so that the backup is in progress for a shorter period of time, as
while it is running it will stop other standbys from starting up....this
shouldn't cause any issues eh?

2011-10-25 13:43:24.035 MDT [15072]: [1-1] LOG: database system was
interrupted; last known up at 2011-10-25 13:43:11 MDT
2011-10-25 13:43:24.035 MDT [15072]: [2-1] LOG: creating missing WAL
directory "pg_xlog/archive_status"
2011-10-25 13:43:24.037 MDT [15072]: [3-1] LOG: entering standby mode
DEBUG: received replication command: IDENTIFY_SYSTEM
DEBUG: received replication command: START_REPLICATION 2/CF000000
2011-10-25 13:43:24.041 MDT [15073]: [1-1] LOG: streaming replication
successfully connected to primary
2011-10-25 13:43:24.177 MDT [15092]: [1-1] FATAL: the database system is
starting up
2011-10-25 13:43:24.781 MDT [15072]: [4-1] DEBUG: checkpoint record is at
2/CF81A478
2011-10-25 13:43:24.781 MDT [15072]: [5-1] DEBUG: redo record is at
2/CF000020; shutdown FALSE
2011-10-25 13:43:24.781 MDT [15072]: [6-1] DEBUG: next transaction ID:
0/4634700; next OID: 1188228
2011-10-25 13:43:24.781 MDT [15072]: [7-1] DEBUG: next MultiXactId: 839;
next MultiXactOffset: 1686
2011-10-25 13:43:24.781 MDT [15072]: [8-1] DEBUG: oldest unfrozen
transaction ID: 1669, in database 1
2011-10-25 13:43:24.781 MDT [15072]: [9-1] DEBUG: transaction ID wrap limit
is 2147485316, limited by database with OID 1
2011-10-25 13:43:24.783 MDT [15072]: [10-1] DEBUG: resetting unlogged
relations: cleanup 1 init 0
2011-10-25 13:43:24.791 MDT [15072]: [11-1] DEBUG: initializing for hot
standby
2011-10-25 13:43:24.791 MDT [15072]: [12-1] LOG: consistent recovery state
reached at 2/CF81A4D0
2011-10-25 13:43:24.791 MDT [15072]: [13-1] LOG: redo starts at 2/CF000020
2011-10-25 13:43:25.019 MDT [15072]: [14-1] LOG: consistent state delayed
because recovery snapshot incomplete
2011-10-25 13:43:25.019 MDT [15072]: [15-1] CONTEXT: xlog redo running
xacts:
nextXid 4634700 latestCompletedXid 4634698 oldestRunningXid 4634336; 130
xacts:
4634336 4634337 4634338 4634339 4634340 4634341 4634342 4634343 4634344
4634345
4634346 4634347 4634348 4634349 4634350 4634351 4634352 4634353 4634354
4634355
4634356 4634357 4634358 4634359 4634360 4634361 4634362 4634363 4634364
4634365
4634366 4634367 4634368 4634369 4634370 4634371 4634515 4634516 4634517
4634518
4634519 4634520 4634521 4634522 4634523 4634524 4634525 4634526 4634527
4634528
4634529 4634530 4634531 4634532 4634533 4634534 4634535 4634536 4634537
4634538
4634539 4634540 4634541 4634542 4634543 4634385 4634386 4634387 4634388
4634389
4634390 4634391 4634392 4634393 4634394 4634395 4634396 4634397 4634398
4634399
4634400 4634401 4634402 4634403 4634404 4634405 4634406 4634407 4634408
4634409
4634410 4634411 4634412 4634413 4634414 4634415 4634416 4634417 4634418
4634419
4634420 4634579 4634580 4634581 4634582 4634583 4634584 4634585 4634586
4634587
4634588 4634589 4634590 4634591 4634592 4634593 4634594 4634595 4634596
4634597
4634598 4634599 4634600 4634601 4634602 4634603 4634604 4634605 4634606
4634607;
subxid ovf
2011-10-25 13:43:25.240 MDT [15130]: [1-1] FATAL: the database system is
starting up
DEBUG: standby "sync_rep_test" has now caught up with primary
2011-10-25 13:43:26.304 MDT [15167]: [1-1] FATAL: the database system is
starting up
2011-10-25 13:43:27.366 MDT [15204]: [1-1] FATAL: the database system is
starting up
2011-10-25 13:43:28.426 MDT [15241]: [1-1] FATAL: the database system is
starting up
2011-10-25 13:43:29.461 MDT [15275]: [1-1] FATAL: the database system is
starting up
and so on...

On Tue, Oct 25, 2011 at 6:51 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

> On Tue, Oct 25, 2011 at 12:39 PM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
>
> > What I don't understand is how this affects the CLOG. How does
> oldestActiveXID
> > factor into CLOG initialization?
>
> It is an entirely different error.
>
> Chris' clog error was caused by a file read error. The file was
> opened, we did a seek within the file and then the call to read()
> failed to return a complete page from the file.
>
> The xid shown is 22811359, which is the nextxid in the control file.
>
> So StartupClog() must have failed trying to read the clog page from disk.
>
> That isn't a Hot Standby problem, a recovery problem nor is it certain
> its a PostgreSQL problem.
>
> OTOH SlruPhysicalReadPage() does cope gracefully with missing clog
> files during recovery, so maybe we can think of a way to make recovery
> cope with a SLRU_READ_FAILED error gracefully also. Any ideas?
>
> --
> Simon Riggs http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2011-10-25 21:13:55 Re: Firing order of RI triggers
Previous Message Euler Taveira de Oliveira 2011-10-25 19:56:12 autovacuum workers warning