Re: Strange replication problem - segment restored from archive but still requested from master

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Piotr Gasidło <quaker(at)barbara(dot)eu(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Strange replication problem - segment restored from archive but still requested from master
Date: 2015-05-22 04:55:40
Message-ID: CAHGQGwE5dq7Ui9P6_ftnncdyWm5SWmxgxZuae5ZnNRk6hmFuoA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

On Fri, May 22, 2015 at 5:52 AM, Piotr Gasidło <quaker(at)barbara(dot)eu(dot)org> wrote:
> Got strange problem. Unable to repeat, but got logs.
>
> Simple master-slave using streaming replication.
> Master is running. Slave is down.
> Segment 0000000400004C4D00000090 was successfully archived and send
> from master to slave.
>
> Now I've started slave, and:
>
> ay 21 21:23:37 d8 postgres[50645]: [3-1] 2015-05-21 21:23:37.033 CEST
> @ 50645 LOG: database system was shut down in recovery at
> 2015-05-21 21:22:03 CEST
> May 21 21:23:37 d8 postgres[50645]: [4-1] 2015-05-21 21:23:37.034 CEST
> @ 50645 LOG: entering standby mode
> May 21 21:23:37 d8 postgres[50645]: [5-1] 2015-05-21 21:23:37.058 CEST
> @ 50645 LOG: restored log file "0000000400004C4D00000088" from
> archive
> May 21 21:23:37 d8 postgres[50645]: [6-1] 2015-05-21 21:23:37.120 CEST
> @ 50645 LOG: redo starts at 4C4D/88493B50
> May 21 21:23:37 d8 postgres[50645]: [7-1] 2015-05-21 21:23:37.226 CEST
> @ 50645 LOG: restored log file "0000000400004C4D00000089" from
> archive
> May 21 21:23:37 d8 postgres[50645]: [8-1] 2015-05-21 21:23:37.426 CEST
> @ 50645 LOG: restored log file "0000000400004C4D0000008A" from
> archive
> May 21 21:23:37 d8 postgres[50645]: [9-1] 2015-05-21 21:23:37.750 CEST
> @ 50645 LOG: restored log file "0000000400004C4D0000008B" from
> archive
> May 21 21:23:38 d8 postgres[50645]: [10-1] 2015-05-21 21:23:38.376
> CEST @ 50645 LOG: restored log file "0000000400004C4D0000008C" from
> archive
> May 21 21:23:38 d8 postgres[50645]: [11-1] 2015-05-21 21:23:38.690
> CEST @ 50645 LOG: restored log file "0000000400004C4D0000008D" from
> archive
> May 21 21:23:38 d8 postgres[50645]: [12-1] 2015-05-21 21:23:38.855
> CEST @ 50645 LOG: restored log file "0000000400004C4D0000008E" from
> archive
> May 21 21:23:39 d8 postgres[50645]: [13-1] 2015-05-21 21:23:39.275
> CEST @ 50645 LOG: restored log file "0000000400004C4D0000008F" from
> archive
> May 21 21:23:39 d8 postgres[50645]: [14-1] 2015-05-21 21:23:39.654
> CEST @ 50645 LOG: restored log file "0000000400004C4D00000090" from
> archive
> May 21 21:23:40 d8 postgres[50645]: [15-1] 2015-05-21 21:23:40.222
> CEST @ 50645 LOG: consistent recovery state reached at
> 4C4D/90FFF9C8
> May 21 21:23:40 d8 postgres[50644]: [3-1] 2015-05-21 21:23:40.222 CEST
> @ 50644 LOG: database system is ready to accept read only
> connections
> May 21 21:23:40 d8 postgres[50645]: [16-1] 2015-05-21 21:23:40.223
> CEST @ 50645 LOG: unexpected pageaddr 4C46/E000000 in log segment
> 0000000400004C4D00000091, offset 0
> May 21 21:23:40 d8 postgres[50699]: [4-1] 2015-05-21 21:23:40.232 CEST
> @ 50699 LOG: started streaming WAL from primary at 4C4D/90000000 on
> timeline 4
> May 21 21:23:40 d8 postgres[50699]: [5-1] 2015-05-21 21:23:40.232 CEST
> @ 50699 FATAL: could not receive data from WAL stream: ERROR:
> requested WAL segment 0000000400004C4D00000090 has already been
> removed
> May 21 21:23:40 d8 postgres[50645]: [17-1] 2015-05-21 21:23:40.255
> CEST @ 50645 LOG: restored log file "0000000400004C4D00000090" from
> archive
> May 21 21:23:40 d8 postgres[50703]: [4-1] 2015-05-21 21:23:40.268 CEST
> @ 50703 LOG: started streaming WAL from primary at 4C4D/90000000 on
> timeline 4
> May 21 21:23:40 d8 postgres[50703]: [5-1] 2015-05-21 21:23:40.268 CEST
> @ 50703 FATAL: could not receive data from WAL stream: ERROR:
> requested WAL segment 0000000400004C4D00000090 has already been
> removed
> May 21 21:23:40 d8 postgres[50703]: [5-2]
> ...
> (and so on)
>
> So, as I understand:
> - slave was started and entered restore,
> - slave restored 0000000400004C4D00000090 from archive, reached
> consistent recovery state
> - now, it connected to master and noticed, that, it has
> 0000000400004C4D00000091 segment uncomplete
> - and then, started yelling about missing segment (master deleted it
> already after archiving) on master
>
> Why?

Thanks for the report! This seems to be a bug.

This problem happens when WAL record is stored in separate two WAL files and
there is no valid latter WAL file in the standby. In your case, the former file
is 0000000400004C4D00000090 and the latter is 0000000400004C4D00000091.

In this case, the first half of WAL record can be read from the former WAL file,
but the remaining half not because no valid latter file exists in the standby.
Then the standby tries to retrieve the latter WAL file via replication.
The problem here is that the standby tries to start the replication from the
starting point of WAL record, i.e., that's the location of the former WAL file.
So the already-read WAL file is requested via replication.

To address this problem, maybe we can just use the location where the remaining
half of WAL record starts from as the replication starting location. We would
need to look at carefully the side effect of that change, though.

Regards,

--
Fujii Masao

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Piotr Gasidło 2015-05-22 08:08:35 Re: Strange replication problem - segment restored from archive but still requested from master
Previous Message Adrian Klaver 2015-05-22 04:09:05 Re: date with month and year

Browse pgsql-hackers by date

  From Date Subject
Next Message Christian Ullrich 2015-05-22 06:44:10 Re: hstore_plpython regression test does not work on Python 3
Previous Message Matthew Kelly 2015-05-22 03:53:32 Add a hint when postgresql.conf hot_standby = 'off' and recovery.conf standby = 'on'