Re: [BUGS] Bug in Physical Replication Slots (at least 9.5)?

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, jdnelson(at)dyn(dot)com
Cc: "<pgsql-bugs(at)postgresql(dot)org>" <pgsql-bugs(at)postgresql(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-01-17 11:17:01
Message-ID: CAM103DsCyXsfwZou9xQcKW8ap7UcTFNP_GM0E1orRAHmeJgZoA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

Auch! It is wrong.

Not decrement keep, decrement segno.

2017年1月17日(火) 19:37 Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>:

> Hello. I added pgsql-hackers.
>
> This occurs also on git master and back to 9.4.
>
> At Fri, 13 Jan 2017 08:47:06 -0600, Jonathon Nelson <jdnelson(at)dyn(dot)com>
> wrote in <
> CACJqAM1ydcZcd5DoCp+y5hkWto1ZeGW+Mj8UK7avqctbGJO8Bw(at)mail(dot)gmail(dot)com>
> > On Mon, Nov 28, 2016 at 1:39 PM, Jonathon Nelson <jdnelson(at)dyn(dot)com>
> wrote:
> > > First, the postgresql configuration differs only minimally from the
> stock
> > > config:
> > >
> > > Assume wal_keep_segments = 0.
> > > Assume the use of physical replication slots.
> > > Assume one master, one standby.
> > >
> > > Lastly, we have observed the behavior "in the wild" at least twice and
> in
> > > the lab a dozen or so times.
> > >
> > > EXAMPLE #1 (logs are from the replica):
> > >
> > > user=,db=,app=,client= DEBUG: creating and filling new WAL file
> > > user=,db=,app=,client= DEBUG: done creating and filling new WAL file
> > > user=,db=,app=,client= DEBUG: sending write 6/8B000000 flush
> 6/8A000000
> > > apply 5/748425A0
> > > user=,db=,app=,client= DEBUG: sending write 6/8B000000 flush
> 6/8B000000
> > > apply 5/74843020
> > > <control-c here>
> > > user=,db=,app=,client= DEBUG: postmaster received signal 2
> > > user=,db=,app=,client= LOG: received fast shutdown request
> > > user=,db=,app=,client= LOG: aborting any active transactions
> > >
> > > And, upon restart:
> > >
> > > user=,db=,app=,client= LOG: restartpoint starting: xlog
> > > user=,db=,app=,client= DEBUG: updated min recovery point to
> 6/67002390 on
> > > timeline 1
> > > user=,db=,app=,client= DEBUG: performing replication slot checkpoint
> > > user=,db=,app=,client= DEBUG: updated min recovery point to
> 6/671768C0 on
> > > timeline 1
> > > user=,db=,app=,client= CONTEXT: writing block 589 of relation
> > > base/13294/16501
> > > user=,db=,app=,client= LOG: invalid magic number 0000 in log segment
> > > 00000001000000060000008B, offset 0
> > > user=,db=,app=,client= DEBUG: switched WAL source from archive to
> stream
> > > after failure
> > > user=,db=,app=,client= LOG: started streaming WAL from primary at
> > > 6/8A000000 on timeline 1
> > > user=,db=,app=,client= FATAL: could not receive data from WAL stream:
> > > ERROR: requested WAL segment 00000001000000060000008A has already been
> > > removed
>
> I managed to reproduce this. A little tweak as the first patch
> lets the standby to suicide as soon as walreceiver sees a
> contrecord at the beginning of a segment.
>
> - M(aster): createdb as a master with wal_keep_segments = 0
> (default), min_log_messages = debug2
> - M: Create a physical repslot.
> - S(tandby): Setup a standby database.
> - S: Edit recovery.conf to use the replication slot above then
> start it.
> - S: touch /tmp/hoge
> - M: Run pgbench ...
> - S: After a while, the standby stops.
> > LOG: #################### STOP THE SERVER
>
> - M: Stop pgbench.
> - M: Do 'checkpoint;' twice.
> - S: rm /tmp/hoge
> - S: Fails to catch up with the following error.
>
> > FATAL: could not receive data from WAL stream: ERROR: requested WAL
> segment 00000001000000000000002B has already been removed
>
>
> This problem occurs when only the earlier parts in a continued
> record is replicated then the segment is removed on the
> master. In other words, the first half is only on standby, and
> the second half is only on the master.
>
> I believe that a continuation record cannot be span over three or
> more *segments* (is it right?), so kepping one spare segment
> would be enough. The attached second patch does this.
>
>
> Other possible measures might be,
>
> - Allowing switching wal source while reading a continuation
> record. Currently ReadRecord assumes that a continuation record
> can be read from single source. But this needs refactoring
> involving xlog.c, xlogreader.c and relatives.
>
> - Delaying recycing a segment until the last partial record on it
> completes. This seems doable in page-wise (coarse resolution)
> but would cost additional reading of past xlog files (page
> header of past pages is required).
>
> - Delaying write/flush feedback until the current record is
> completed. walreceiver is not conscious of a WAL record and
> this might break synchronous replication.
>
> Any thoughts?
>
>
> =========================================
> > > A physical analysis shows that the WAL file 00000001000000060000008B is
> > > 100% zeroes (ASCII NUL).
>
> I suppose it is on the standby so the segment file is the one
> where the next transferred record will be written onto.
>
> > > The results of querying pg_replication_slots shows a restart_lsn that
> > > matches ….6/8B.
>
> It is the beginning of the next record to be replicatd as
> documentation. In other words, just after the last transferred
> record (containing padding).
>
> > > Pg_controldata shows values like:
> > > Minimum recovery ending location: 6/8Axxxxxx
>
> It is the beginning of the last applied record.
>
> > > How can the master show a position that is greater than the minimum
> > > recovery ending location?
>
> So it is natural that the former is larger than the latter.
>
> > > EXAMPLE #2:
> > >
> > > Minimum recovery ending location: 19DD/73FFFFE0
> > > Log segment 00000001000019DD00000073 was not available.
> > > The restart LSN was 19DD/74000000.
> > > The last few lines from pg_xlogdump 00000001000019DD00000073:
> > >
> > >
> > > rmgr: Btree len (rec/tot): 2/ 64, tx: 77257, lsn:
> > > 19DD/73FFFF60, prev 19DD/73FFFF20, desc: INSERT_LEAF off 132, blkref
> #0:
> > > rel 1663/16403/150017028 blk 1832
> > > rmgr: Btree len (rec/tot): 2/ 64, tx: 77257, lsn:
> > > 19DD/73FFFFA0, prev 19DD/73FFFF60, desc: INSERT_LEAF off 206, blkref
> #0:
> > > rel 1663/16403/150017028 blk 11709
> > >
> > > If I'm understanding this properly, (0x73FFFFA0 - 0x73000000) is the
> first
> > > byte of the last record in this file, and the record length is 64 bytes
> > > which places the first byte of the next record at: 16777184 (0xffffe0)
> > > (logical position 0x73ffffe0: this jives with pg_controldata).
>
> Maybe right. pg_xlogdump skips partial records.
>
> > > However, there are only 32 bytes of file left:
> > > 0x73FFFFA0 - 0x73000000 + 64 -=> 16777184
> > > 16777216 - 16777184 -=> 32
> > >
> > > Which means that the next record is in the WAL file
> > > 00000001000019DD00000074.
>
> Maybe right.
>
> > > A possible theory:
> > >
> > > Let us assume PG has applied 100% of the data in a given WAL file, and
> > > let’s assume (as in this case) that the WAL file is
> > > 00000001000019DD00000073. When it starts up again, it uses the control
> > > data to start and say “The next record is at 19DD/0x73ffffe0" which it
> > > truncates to 0x73000000. However, PG has *also* already told the
> master
> > > that is has fully received, written, and flushed all of the data for
> that
> > > WAL file, so the master has 0x74000000 as the start position (and has
> > > consequently removed the WAL file for 0x73). The relationship between
> > > pg_controldata and pg_replication_slots.restart_lsn seem to be very
> > > slightly (but importantly) at odds.
> > >
> > > Could it be this part of the code?
>
> No. the code does the right thing. The problem is that a
> continuation record is assumed to be on the same wal source, that
> is, archive/wal and streaming. But a continueation record is
> distributed to two sources.
>
> > > From src/backend/replication/walreceiverfuncs.c in RequestXLogStreaming
> > > (as of a0aa358ca603d8189fe4be72f614cf7cf363d81a):
> > >
> > > 235 /*
> > > 236 * We always start at the beginning of the segment. That
> prevents
> > > a broken
> > > 237 * segment (i.e., with no records in the first half of a
> segment)
> > > from
> > > 238 * being created by XLOG streaming, which might cause trouble
> > > later on if
> > > 239 * the segment is e.g archived.
> > > 240 */
> > > 241 if (recptr % XLogSegSize != 0)
> > > 242 recptr -= recptr % XLogSegSize;
> > > 243
> > >
> > > We start up with 19DD/0x73ffffe0 (but there would not be enough room in
> > > that segment for any more records, so logically we'd have to go to
> > > 19DD/0x74000000). When we start WAL receiving, we truncate 0x73ffffe0
> to
> > > 0x73000000, which the master has already removed (and - technically -
> we
> > > don't actually need?).
>
> 0x73ffffe0 and 0x73000000 are on the same segment. Current
> recovery mechanism requires reading the record starts from
> 0x73ffffe0 and it is on the standby and it is read.
>
>
> regards,
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center
>
> --
> Sent via pgsql-bugs mailing list (pgsql-bugs(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-bugs
>

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Amit Ghorpade 2017-01-17 11:43:05 Need To Restore Database in Pgadmin 4
Previous Message vodevsh 2017-01-17 11:09:18 BUG #14500: error message displays "schema dot relation" in double quotes all together

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2017-01-17 11:35:57 Re: Support for pg_receivexlog --format=plain|tar
Previous Message Magnus Hagander 2017-01-17 11:11:54 Re: Support for pg_receivexlog --format=plain|tar