Re: During promotion, new master tries to archive same segment twice

From: Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
To: Phil Endecott <spam_from_pgsql_lists(at)chezphil(dot)org>, pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: During promotion, new master tries to archive same segment twice
Date: 2018-08-15 20:55:05
Message-ID: 5e8344fd-e736-9439-9a10-eb8838e51c03@aklaver.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 08/15/2018 01:25 PM, Phil Endecott wrote:
> Dear Experts,
>
> Here is my latest issue with replication:
>
> I have 3 systems, X, Y and Z. Initially X is replicated to Y
> and Y is replicated to Z; in each case the replication involves
> log-shipping using archive_command and restore_command (via a
> 4th system called "backup") and then streaming.
>
> The plan is to shut down X and to promote Y to be the new master.
>
> I shut down X, and as expected Y complains that it con no longer
> connect to it:
>
> 2018-08-15 15:10:58.785 UTC [617] LOG: replication terminated by primary server
> 2018-08-15 15:10:58.785 UTC [617] DETAIL: End of WAL reached on timeline 1 at 7/E9000098.
> 2018-08-15 15:10:58.785 UTC [617] FATAL: could not send end-of-streaming message to primary: no COPY in progress
> scp: backup/postgresql/archivedir/0000000100000007000000E9: No such file or directory
> 2018-08-15 15:10:59.928 UTC [354] LOG: invalid resource manager ID 22 at 7/E9000098
> 2018-08-15 15:11:00.092 UTC [5856] FATAL: could not connect to the primary server: FATAL: the database system is shutting down
>
> Note that in the middle of that, after streaming has failed it
> tries to fetch the next WAL segment, E9, from the backup; this
> fails because this segment doesn't exist.
>
> Then I pg_ctl promote on Y:
>
> 2018-08-15 15:12:11.888 UTC [354] LOG: received promote request
> 2018-08-15 15:12:11.888 UTC [354] LOG: redo done at 7/E9000028
> 2018-08-15 15:12:11.888 UTC [354] LOG: last completed transaction was at log time 2018-08-15 14:45:05.961153+00
> scp: backup/postgresql/archivedir/0000000100000007000000E9: No such file or directory
> scp: backup/postgresql/archivedir/00000002.history: No such file or directory
> 2018-08-15 15:12:13.316 UTC [354] LOG: selected new timeline ID: 2
> 2018-08-15 15:12:13.368 UTC [354] FATAL: could not open file "recovery.conf": Permission denied
>
> So in the process of exiting recovery mode it tries to rename
> recovery.conf to recovery.done, but it fails to do so because
> my recovery.conf was owned by root, not user postgres. Oooops.
> Perhaps it would be appropriate to check for that when it
> initially reads recovery.conf? Anyway, I fix the permissions
> and restart it. It starts in recovery mode and cannot connect
> to the master, as expected.
>
> 2018-08-15 15:16:58.061 UTC [6036] LOG: restored log file "0000000100000007000000E8" from archive
> 2018-08-15 15:16:58.108 UTC [6036] LOG: redo starts at 7/E8678300
> 2018-08-15 15:16:58.731 UTC [6087] LOG: archive command failed with exit code 1
> 2018-08-15 15:16:58.731 UTC [6087] DETAIL: The failed archive command was: ssh backup test ! -f backup/postgresql/Y/0000000100000007000000E8 && scp pg_xlog/0000000100000007000000E8 backup:backup/postgresql/Y/0000000100000007000000E8
> scp: backup/postgresql/archivedir/0000000100000007000000E9: No such file or directory

The above is not clear to me. My best guess:

1) The ssh backup ... is archiving WAL's from Y to Y specific archive on
'backup'?

2) The scp: backup ... is the restore_command fetching archived WAL's
that originated on X and where sent to backup/postgresql/archivedir/ ?

3) If 2) is true why is it showing up in the error for the archive command?

Can you show us your actual archive and restore commands?

More below.

> 2018-08-15 15:16:58.823 UTC [6036] LOG: consistent recovery state reached at 7/E9000098
> 2018-08-15 15:16:58.823 UTC [6036] LOG: invalid resource manager ID 22 at 7/E9000098
> 2018-08-15 15:16:58.824 UTC [6035] LOG: database system is ready to accept read only connections
> 2018-08-15 15:16:58.855 UTC [6094] FATAL: could not connect to the primary server: could not connect to server: Connection refused
>

>
> I can now connect to the db and do read/write operations, but it
> continues to try and fail to re-write segment E8 to its archive.
>
> Eventually I manually delete E8 from the archive; it then
> succeeds at writing it with the same content as before and seems to
> continue to function normally.

Which archive?

>
> Questions:
>
> - Can all of this be blamed on the failure of the first attempt
> to promote due to the wrong permissions on recovery.conf?
>
> - Should my archive_command detect the case where it is asked to
> write the same file again with the same contents, and report
> success in that case?
>
> - Is this a bug?
>
>
> Thanks,
>
> Phil.
>
>
>
>

--
Adrian Klaver
adrian(dot)klaver(at)aklaver(dot)com

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Bruce Momjian 2018-08-15 20:57:58 Re: How to revoke privileged from PostgreSQL's superuser
Previous Message Evan Rempel 2018-08-15 20:52:43 Re: How to revoke privileged from PostgreSQL's superuser