Re: Bug in Physical Replication Slots (at least 9.5)?

From: Jonathon Nelson <jdnelson(at)dyn(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Subject: Re: Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-01-13 14:47:06
Message-ID: CACJqAM1ydcZcd5DoCp+y5hkWto1ZeGW+Mj8UK7avqctbGJO8Bw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On Mon, Nov 28, 2016 at 1:39 PM, Jonathon Nelson <jdnelson(at)dyn(dot)com> wrote:

> We think we have discovered a bug in the physical replication slots
> functionality in PostgreSQL 9.5.
> We've seen the behavior across Operating Systems (CentOS-7 and openSUSE
> LEAP 42.1), filesystems (ext4 and xfs), and versions (9.5.3 and 9.5.4). All
> were on x86_64.
>
> We notice that if we stop and then re-start the *standby*, upon restart it
> will - sometimes - request a WAL file that the master no longer has.
>

I hate to largely re-quote my entire wall-of-text email/bug report, but
there were no responses to this (to be fair, it was at the end of a month
of US holidays, etc...).

Is there more information I should provide? Can I get this added to some
sort of official bug list (it doesn't have a bug number)?

Any help or advice here would be appreciated.

> First, the postgresql configuration differs only minimally from the stock
> config:
>
> Assume wal_keep_segments = 0.
> Assume the use of physical replication slots.
> Assume one master, one standby.
>
> Lastly, we have observed the behavior "in the wild" at least twice and in
> the lab a dozen or so times.
>
> EXAMPLE #1 (logs are from the replica):
>
> user=,db=,app=,client= DEBUG: creating and filling new WAL file
> user=,db=,app=,client= DEBUG: done creating and filling new WAL file
> user=,db=,app=,client= DEBUG: sending write 6/8B000000 flush 6/8A000000
> apply 5/748425A0
> user=,db=,app=,client= DEBUG: sending write 6/8B000000 flush 6/8B000000
> apply 5/74843020
> <control-c here>
> user=,db=,app=,client= DEBUG: postmaster received signal 2
> user=,db=,app=,client= LOG: received fast shutdown request
> user=,db=,app=,client= LOG: aborting any active transactions
>
> And, upon restart:
>
> user=,db=,app=,client= LOG: restartpoint starting: xlog
> user=,db=,app=,client= DEBUG: updated min recovery point to 6/67002390 on
> timeline 1
> user=,db=,app=,client= DEBUG: performing replication slot checkpoint
> user=,db=,app=,client= DEBUG: updated min recovery point to 6/671768C0 on
> timeline 1
> user=,db=,app=,client= CONTEXT: writing block 589 of relation
> base/13294/16501
> user=,db=,app=,client= LOG: invalid magic number 0000 in log segment
> 00000001000000060000008B, offset 0
> user=,db=,app=,client= DEBUG: switched WAL source from archive to stream
> after failure
> user=,db=,app=,client= LOG: started streaming WAL from primary at
> 6/8A000000 on timeline 1
> user=,db=,app=,client= FATAL: could not receive data from WAL stream:
> ERROR: requested WAL segment 00000001000000060000008A has already been
> removed
>
> A physical analysis shows that the WAL file 00000001000000060000008B is
> 100% zeroes (ASCII NUL).
>
> The results of querying pg_replication_slots shows a restart_lsn that
> matches ….6/8B.
>
> Pg_controldata shows values like:
> Minimum recovery ending location: 6/8Axxxxxx
>
> How can the master show a position that is greater than the minimum
> recovery ending location?
>
> EXAMPLE #2:
>
> Minimum recovery ending location: 19DD/73FFFFE0
> Log segment 00000001000019DD00000073 was not available.
> The restart LSN was 19DD/74000000.
> The last few lines from pg_xlogdump 00000001000019DD00000073:
>
>
> rmgr: Btree len (rec/tot): 2/ 64, tx: 77257, lsn:
> 19DD/73FFFF60, prev 19DD/73FFFF20, desc: INSERT_LEAF off 132, blkref #0:
> rel 1663/16403/150017028 blk 1832
> rmgr: Btree len (rec/tot): 2/ 64, tx: 77257, lsn:
> 19DD/73FFFFA0, prev 19DD/73FFFF60, desc: INSERT_LEAF off 206, blkref #0:
> rel 1663/16403/150017028 blk 11709
>
> If I'm understanding this properly, (0x73FFFFA0 - 0x73000000) is the first
> byte of the last record in this file, and the record length is 64 bytes
> which places the first byte of the next record at: 16777184 (0xffffe0)
> (logical position 0x73ffffe0: this jives with pg_controldata).
>
> However, there are only 32 bytes of file left:
> 0x73FFFFA0 - 0x73000000 + 64 -=> 16777184
> 16777216 - 16777184 -=> 32
>
> Which means that the next record is in the WAL file
> 00000001000019DD00000074.
>
> A possible theory:
>
> Let us assume PG has applied 100% of the data in a given WAL file, and
> let’s assume (as in this case) that the WAL file is
> 00000001000019DD00000073. When it starts up again, it uses the control
> data to start and say “The next record is at 19DD/0x73ffffe0" which it
> truncates to 0x73000000. However, PG has *also* already told the master
> that is has fully received, written, and flushed all of the data for that
> WAL file, so the master has 0x74000000 as the start position (and has
> consequently removed the WAL file for 0x73). The relationship between
> pg_controldata and pg_replication_slots.restart_lsn seem to be very
> slightly (but importantly) at odds.
>
> Could it be this part of the code?
>
> From src/backend/replication/walreceiverfuncs.c in RequestXLogStreaming
> (as of a0aa358ca603d8189fe4be72f614cf7cf363d81a):
>
> 235 /*
> 236 * We always start at the beginning of the segment. That prevents
> a broken
> 237 * segment (i.e., with no records in the first half of a segment)
> from
> 238 * being created by XLOG streaming, which might cause trouble
> later on if
> 239 * the segment is e.g archived.
> 240 */
> 241 if (recptr % XLogSegSize != 0)
> 242 recptr -= recptr % XLogSegSize;
> 243
>
> We start up with 19DD/0x73ffffe0 (but there would not be enough room in
> that segment for any more records, so logically we'd have to go to
> 19DD/0x74000000). When we start WAL receiving, we truncate 0x73ffffe0 to
> 0x73000000, which the master has already removed (and - technically - we
> don't actually need?).
>
>
> --
> Jon Nelson
> Dyn / Principal Software Engineer
>

--
Jon Nelson
Dyn / Principal Software Engineer

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message linasvepstas 2017-01-13 18:38:24 BUG #14494: Regression - Null arrays are not queryable
Previous Message Tom Lane 2017-01-12 21:27:56 Re: BUG #14456: pg_dump doesn't restore permissions on tables belonging to an extension

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2017-01-13 14:55:06 Re: [PATCH] Rename pg_switch_xlog to pg_switch_wal
Previous Message Peter Eisentraut 2017-01-13 14:45:53 Re: pg_upgrade vs. pg_ctl stop -m fast