Re: Bug in Physical Replication Slots (at least 9.5)?

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: jdnelson(at)dyn(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org, pgsql-bugs(at)postgresql(dot)org
Subject: Re: Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-01-17 10:36:45
Message-ID: 20170117.193645.160386781.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

Hello. I added pgsql-hackers.

This occurs also on git master and back to 9.4.

At Fri, 13 Jan 2017 08:47:06 -0600, Jonathon Nelson <jdnelson(at)dyn(dot)com> wrote in <CACJqAM1ydcZcd5DoCp+y5hkWto1ZeGW+Mj8UK7avqctbGJO8Bw(at)mail(dot)gmail(dot)com>
> On Mon, Nov 28, 2016 at 1:39 PM, Jonathon Nelson <jdnelson(at)dyn(dot)com> wrote:
> > First, the postgresql configuration differs only minimally from the stock
> > config:
> >
> > Assume wal_keep_segments = 0.
> > Assume the use of physical replication slots.
> > Assume one master, one standby.
> >
> > Lastly, we have observed the behavior "in the wild" at least twice and in
> > the lab a dozen or so times.
> >
> > EXAMPLE #1 (logs are from the replica):
> >
> > user=,db=,app=,client= DEBUG: creating and filling new WAL file
> > user=,db=,app=,client= DEBUG: done creating and filling new WAL file
> > user=,db=,app=,client= DEBUG: sending write 6/8B000000 flush 6/8A000000
> > apply 5/748425A0
> > user=,db=,app=,client= DEBUG: sending write 6/8B000000 flush 6/8B000000
> > apply 5/74843020
> > <control-c here>
> > user=,db=,app=,client= DEBUG: postmaster received signal 2
> > user=,db=,app=,client= LOG: received fast shutdown request
> > user=,db=,app=,client= LOG: aborting any active transactions
> >
> > And, upon restart:
> >
> > user=,db=,app=,client= LOG: restartpoint starting: xlog
> > user=,db=,app=,client= DEBUG: updated min recovery point to 6/67002390 on
> > timeline 1
> > user=,db=,app=,client= DEBUG: performing replication slot checkpoint
> > user=,db=,app=,client= DEBUG: updated min recovery point to 6/671768C0 on
> > timeline 1
> > user=,db=,app=,client= CONTEXT: writing block 589 of relation
> > base/13294/16501
> > user=,db=,app=,client= LOG: invalid magic number 0000 in log segment
> > 00000001000000060000008B, offset 0
> > user=,db=,app=,client= DEBUG: switched WAL source from archive to stream
> > after failure
> > user=,db=,app=,client= LOG: started streaming WAL from primary at
> > 6/8A000000 on timeline 1
> > user=,db=,app=,client= FATAL: could not receive data from WAL stream:
> > ERROR: requested WAL segment 00000001000000060000008A has already been
> > removed

I managed to reproduce this. A little tweak as the first patch
lets the standby to suicide as soon as walreceiver sees a
contrecord at the beginning of a segment.

- M(aster): createdb as a master with wal_keep_segments = 0
(default), min_log_messages = debug2
- M: Create a physical repslot.
- S(tandby): Setup a standby database.
- S: Edit recovery.conf to use the replication slot above then
start it.
- S: touch /tmp/hoge
- M: Run pgbench ...
- S: After a while, the standby stops.
> LOG: #################### STOP THE SERVER

- M: Stop pgbench.
- M: Do 'checkpoint;' twice.
- S: rm /tmp/hoge
- S: Fails to catch up with the following error.

> FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 00000001000000000000002B has already been removed

This problem occurs when only the earlier parts in a continued
record is replicated then the segment is removed on the
master. In other words, the first half is only on standby, and
the second half is only on the master.

I believe that a continuation record cannot be span over three or
more *segments* (is it right?), so kepping one spare segment
would be enough. The attached second patch does this.

Other possible measures might be,

- Allowing switching wal source while reading a continuation
record. Currently ReadRecord assumes that a continuation record
can be read from single source. But this needs refactoring
involving xlog.c, xlogreader.c and relatives.

- Delaying recycing a segment until the last partial record on it
completes. This seems doable in page-wise (coarse resolution)
but would cost additional reading of past xlog files (page
header of past pages is required).

- Delaying write/flush feedback until the current record is
completed. walreceiver is not conscious of a WAL record and
this might break synchronous replication.

Any thoughts?

=========================================
> > A physical analysis shows that the WAL file 00000001000000060000008B is
> > 100% zeroes (ASCII NUL).

I suppose it is on the standby so the segment file is the one
where the next transferred record will be written onto.

> > The results of querying pg_replication_slots shows a restart_lsn that
> > matches ….6/8B.

It is the beginning of the next record to be replicatd as
documentation. In other words, just after the last transferred
record (containing padding).

> > Pg_controldata shows values like:
> > Minimum recovery ending location: 6/8Axxxxxx

It is the beginning of the last applied record.

> > How can the master show a position that is greater than the minimum
> > recovery ending location?

So it is natural that the former is larger than the latter.

> > EXAMPLE #2:
> >
> > Minimum recovery ending location: 19DD/73FFFFE0
> > Log segment 00000001000019DD00000073 was not available.
> > The restart LSN was 19DD/74000000.
> > The last few lines from pg_xlogdump 00000001000019DD00000073:
> >
> >
> > rmgr: Btree len (rec/tot): 2/ 64, tx: 77257, lsn:
> > 19DD/73FFFF60, prev 19DD/73FFFF20, desc: INSERT_LEAF off 132, blkref #0:
> > rel 1663/16403/150017028 blk 1832
> > rmgr: Btree len (rec/tot): 2/ 64, tx: 77257, lsn:
> > 19DD/73FFFFA0, prev 19DD/73FFFF60, desc: INSERT_LEAF off 206, blkref #0:
> > rel 1663/16403/150017028 blk 11709
> >
> > If I'm understanding this properly, (0x73FFFFA0 - 0x73000000) is the first
> > byte of the last record in this file, and the record length is 64 bytes
> > which places the first byte of the next record at: 16777184 (0xffffe0)
> > (logical position 0x73ffffe0: this jives with pg_controldata).

Maybe right. pg_xlogdump skips partial records.

> > However, there are only 32 bytes of file left:
> > 0x73FFFFA0 - 0x73000000 + 64 -=> 16777184
> > 16777216 - 16777184 -=> 32
> >
> > Which means that the next record is in the WAL file
> > 00000001000019DD00000074.

Maybe right.

> > A possible theory:
> >
> > Let us assume PG has applied 100% of the data in a given WAL file, and
> > let’s assume (as in this case) that the WAL file is
> > 00000001000019DD00000073. When it starts up again, it uses the control
> > data to start and say “The next record is at 19DD/0x73ffffe0" which it
> > truncates to 0x73000000. However, PG has *also* already told the master
> > that is has fully received, written, and flushed all of the data for that
> > WAL file, so the master has 0x74000000 as the start position (and has
> > consequently removed the WAL file for 0x73). The relationship between
> > pg_controldata and pg_replication_slots.restart_lsn seem to be very
> > slightly (but importantly) at odds.
> >
> > Could it be this part of the code?

No. the code does the right thing. The problem is that a
continuation record is assumed to be on the same wal source, that
is, archive/wal and streaming. But a continueation record is
distributed to two sources.

> > From src/backend/replication/walreceiverfuncs.c in RequestXLogStreaming
> > (as of a0aa358ca603d8189fe4be72f614cf7cf363d81a):
> >
> > 235 /*
> > 236 * We always start at the beginning of the segment. That prevents
> > a broken
> > 237 * segment (i.e., with no records in the first half of a segment)
> > from
> > 238 * being created by XLOG streaming, which might cause trouble
> > later on if
> > 239 * the segment is e.g archived.
> > 240 */
> > 241 if (recptr % XLogSegSize != 0)
> > 242 recptr -= recptr % XLogSegSize;
> > 243
> >
> > We start up with 19DD/0x73ffffe0 (but there would not be enough room in
> > that segment for any more records, so logically we'd have to go to
> > 19DD/0x74000000). When we start WAL receiving, we truncate 0x73ffffe0 to
> > 0x73000000, which the master has already removed (and - technically - we
> > don't actually need?).

0x73ffffe0 and 0x73000000 are on the same segment. Current
recovery mechanism requires reading the record starts from
0x73ffffe0 and it is on the standby and it is read.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
repl_border_bug_test.patch text/x-patch 1.0 KB
keep_extra_one_seg_for_replslot.patch text/x-patch 601 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Magnus Hagander 2017-01-17 11:11:54 Re: Support for pg_receivexlog --format=plain|tar
Previous Message Tomas Vondra 2017-01-17 10:28:13 Re: PoC: Grouped base relation

Browse pgsql-bugs by date

  From Date Subject
Next Message vodevsh 2017-01-17 11:09:18 BUG #14500: error message displays "schema dot relation" in double quotes all together
Previous Message Tom Lane 2017-01-16 14:43:32 Re: BUG #14495: Cost of comparator is not taken into account in sorting