Re: [BUGS] Bug in Physical Replication Slots (at least 9.5)?

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: nag1010(at)gmail(dot)com
Cc: jdnelson(at)dyn(dot)com, pgsql-hackers(at)postgresql(dot)org, pgsql-bugs(at)postgresql(dot)org
Subject: Re: [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-03-17 07:48:27
Message-ID: 20170317.164827.46663014.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

Hello,

At Mon, 13 Mar 2017 11:06:00 +1100, Venkata B Nagothi <nag1010(at)gmail(dot)com> wrote in <CAEyp7J-4MmVwGoZSwvaSULZC80JDD_tL-9KsNiqF17+bNqiSBg(at)mail(dot)gmail(dot)com>
> On Tue, Jan 17, 2017 at 9:36 PM, Kyotaro HORIGUCHI <
> horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> > I managed to reproduce this. A little tweak as the first patch
> > lets the standby to suicide as soon as walreceiver sees a
> > contrecord at the beginning of a segment.
> >
> > - M(aster): createdb as a master with wal_keep_segments = 0
> > (default), min_log_messages = debug2
> > - M: Create a physical repslot.
> > - S(tandby): Setup a standby database.
> > - S: Edit recovery.conf to use the replication slot above then
> > start it.
> > - S: touch /tmp/hoge
> > - M: Run pgbench ...
> > - S: After a while, the standby stops.
> > > LOG: #################### STOP THE SERVER
> >
> > - M: Stop pgbench.
> > - M: Do 'checkpoint;' twice.
> > - S: rm /tmp/hoge
> > - S: Fails to catch up with the following error.
> >
> > > FATAL: could not receive data from WAL stream: ERROR: requested WAL
> > segment 00000001000000000000002B has already been removed
> >
> >
> I have been testing / reviewing the latest patch
> "0001-Fix-a-bug-of-physical-replication-slot.patch" and i think, i might
> need some more clarification on this.
>
> Before applying the patch, I tried re-producing the above error -
>
> - I had master->standby in streaming replication
> - Took the backup of master
> - with a low max_wal_size and wal_keep_segments = 0
> - Configured standby with recovery.conf
> - Created replication slot on master
> - Configured the replication slot on standby and started the standby

I suppose the "configure" means primary_slot_name in recovery.conf.

> - I got the below error
>
> >> 2017-03-10 11:58:15.704 AEDT [478] LOG: invalid record length at
> 0/F2000140: wanted 24, got 0
> >> 2017-03-10 11:58:15.706 AEDT [481] LOG: started streaming WAL from
> primary at 0/F2000000 on timeline 1
> >> 2017-03-10 11:58:15.706 AEDT [481] FATAL: could not receive data
> from WAL stream: ERROR: requested WAL segment 0000000100000000000000F2 has
> already been removed

Maybe you created the master slot with non-reserve (default) mode
and put a some-minites pause after making the backup and before
starting the standby. For the case the master slot doesn't keep
WAL segments unless the standby connects so a couple of
checkpoints can blow away the first segment required by the
standby. This is quite reasonable behavior. The following steps
makes this more sure.

> - Took the backup of master
> - with a low max_wal_size = 2 and wal_keep_segments = 0
> - Configured standby with recovery.conf
> - Created replication slot on master
+ - SELECT pg_switch_wal(); on master twice.
+ - checkpoint; on master twice.
> - Configured the replication slot on standby and started the standby

Creating the slot with the following command will save it.

=# select pg_create_physical_replication_slot('s1', true);

> and i could notice that the file "0000000100000000000000F2" was removed
> from the master. This can be easily re-produced and this occurs
> irrespective of configuring replication slots.
>
> As long as the file "0000000100000000000000F2" is available on the master,
> standby continues to stream WALs without any issues.
...
> If the scenario i created to reproduce the error is correct, then, applying
> the patch is not making a difference.

Yes, the patch is not for saving this case. The patch saves the
case where the previous segment to the first required segment by
standby was removed and it contains the first part of a record
continues to the first required segment. On the other hand this
case is that the segment at the start point of standby is just
removed.

> I think, i need help in building a specific test case which will re-produce
> the specific BUG related to physical replication slots as reported ?
>
> Will continue to review the patch, once i have any comments on this.

Thaks a lot!

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Kuntal Ghosh 2017-03-17 09:17:56 Re: [HACKERS] Two phase commit in ECPG
Previous Message Nikolay Samokhvalov 2017-03-17 04:15:08 Re: ON CONFLICT with constraint name doesn't work

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2017-03-17 07:51:29 Re: Speedup twophase transactions
Previous Message Nikhil Sontakke 2017-03-17 07:42:44 Re: Speedup twophase transactions