Re: A failure of standby to follow timeline switch

From: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: A failure of standby to follow timeline switch
Date: 2020-12-24 06:33:04
Message-ID: 697adab0-a3fe-e1cb-436b-3a8eaa9a2266@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2020/12/09 17:43, Kyotaro Horiguchi wrote:
> Hello.
>
> We found a behavioral change (which seems to be a bug) in recovery at
> PG13.
>
> The following steps might seem somewhat strange but the replication
> code deliberately cope with the case. This is a sequense seen while
> operating a HA cluseter using Pacemaker.
>
> - Run initdb to create a primary.
> - Set archive_mode=on on the primary.
> - Start the primary.
>
> - Create a standby using pg_basebackup from the primary.
> - Stop the standby.
> - Stop the primary.
>
> - Put stnadby.signal to the primary then start it.
> - Promote the primary.
>
> - Start the standby.
>
>
> Until PG12, the parimary signals end-of-timeline to the standby and
> switches to the next timeline. Since PG13, that doesn't happen and
> the standby continues to request for the segment of the older
> timeline, which no longer exists.
>
> FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000000000003 has already been removed
>
> It is because WalSndSegmentOpen() can fail to detect a timeline switch
> on a historic timeline, due to use of a wrong variable to check
> that. It is using state->seg.ws_segno but it seems to be a thinko when
> the code around was refactored in 709d003fbd.
>
> The first patch detects the wrong behavior. The second small patch
> fixes it.

Thanks for reporting this! This looks like a bug.

When I applied two patches in the master branch and
ran "make check-world", I got the following error.

============== creating database "contrib_regression" ==============
# Looks like you planned 37 tests but ran 36.
# Looks like your test exited with 255 just after 36.
t/001_stream_rep.pl ..................
Dubious, test returned 255 (wstat 65280, 0xff00)
Failed 1/37 subtests
...
Test Summary Report
-------------------
t/001_stream_rep.pl (Wstat: 65280 Tests: 36 Failed: 0)
Non-zero exit status: 255
Parse errors: Bad plan. You planned 37 tests but ran 36.
Files=21, Tests=239, 302 wallclock secs ( 0.10 usr 0.05 sys + 41.69 cusr 39.84 csys = 81.68 CPU)
Result: FAIL
make[2]: *** [check] Error 1
make[1]: *** [check-recovery-recurse] Error 2
make[1]: *** Waiting for unfinished jobs....
t/070_dropuser.pl ......... ok

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2020-12-24 06:42:36 Re: postgres_fdw - cached connection leaks if the associated user mapping/foreign server is dropped
Previous Message Amit Kapila 2020-12-24 06:20:24 Re: Cannot ship records to subscriber for partition tables using logical replication (publish_via_partition_root=false)