Re: Duplicate history file?

From: Tatsuro Yamada <tatsuro(dot)yamada(dot)tf(at)nttcom(dot)co(dot)jp>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Duplicate history file?
Date: 2021-06-01 04:03:22
Message-ID: 9bd1cc76-5fb8-6954-dce2-ab8ca56642ef@nttcom.co.jp_1
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Horiguchi-san,

On 2021/05/31 16:58, Kyotaro Horiguchi wrote:
> So, I started a thread for this topic diverged from the following
> thread.
>
> https://www.postgresql.org/message-id/4698027d-5c0d-098f-9a8e-8cf09e36a555@nttcom.co.jp_1
>
>> So, what should we do for the user? I think we should put some notes
>> in postgresql.conf or in the documentation. For example, something
>> like this:
>
> I'm not sure about the exact configuration you have in mind, but that
> would happen on the cascaded standby in the case where the upstream
> promotes. In this case, the history file for the new timeline is
> archived twice. walreceiver triggers archiving of the new history
> file at the time of the promotion, then startup does the same when it
> restores the file from archive. Is it what you complained about?

Thank you for creating a new thread and explaining this.
We are not using cascade replication in our environment, but I think
the situation is similar. As an overview, when I do a promote,
the archive_command fails due to the history file.

I've created a reproduction script that includes building replication,
and I'll share it with you. (I used Robert's test.sh as a reference
for creating the reproduction script. Thanks)

The scenario (sr_test_historyfile.sh) is as follows.

#1 Start pgprimary as a main
#2 Create standby
#3 Start pgstandby as a standby
#4 Execute archive command
#5 Shutdown pgprimary
#6 Start pgprimary as a standby
#7 Promote pgprimary
#8 Execute archive_command again, but failed since duplicate history
file exists (see pgstandby.log)

Note that this may not be appropriate if you consider it as a recovery
procedure for replication configuration. However, I'm sharing it as it is
because this seems to be the procedure used in the customer's environment (PG-REX).


> The same workaround using the alternative archive script works for the
> case.
>
> We could check pg_wal before fetching archive, however, archiving is
> not controlled so strictly that duplicate archiving never happens and
> I think we choose possible duplicate archiving than having holes in
> archive. (so we suggest the "test ! -f" script)
>
>> ====
>> Note: If you use archive_mode=always, the archive_command on the
>> standby side should not be used "test ! -f".
>> ====
>
> It could be one workaround. However, I would suggest not to overwrite
> existing files (with a file with different content) to protect archive
> from corruption.
>
> We might need to write that in the documentation...

I think you're right, replacing it with an alternative archive script
that includes the cmp command will resolve the error. The reason is that
I checked with the diff command that the history files are identical.

=====
$ diff -s pgprimary/arc/00000002.history pgstandby/arc/00000002.history
Files pgprimary/arc/00000002.history and pgstandby/arc/00000002.history are identical
=====

Regarding "test ! -f",
I am wondering how many people are using the test command for
archive_command. If I remember correctly, the guide provided by
NTT OSS Center that we are using does not recommend using the test command.

Regards,
Tatsuro Yamada

Attachment Content-Type Size
pgprimary.log text/plain 3.0 KB
pgstandby.log text/plain 7.0 KB
sr_test_historyfile.sh text/plain 2.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Pryzby 2021-06-01 04:16:35 Re: AWS forcing PG upgrade from v9.6 a disaster
Previous Message Amit Kapila 2021-06-01 04:01:33 Re: Skipping logical replication transactions on subscriber side