Re: Bug in Physical Replication Slots (at least 9.5)?

From: Venkata B Nagothi <nag1010(at)gmail(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Jonathon Nelson <jdnelson(at)dyn(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: Bug in Physical Replication Slots (at least 9.5)?
Date: 2017-03-31 02:40:00
Message-ID: CAEyp7J-BmsMqCVO7zX5vpEoEuRiM+KVPZPJzodUqOdOTZsLpqQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On Fri, Mar 17, 2017 at 6:48 PM, Kyotaro HORIGUCHI <
horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:

> Hello,
>
> At Mon, 13 Mar 2017 11:06:00 +1100, Venkata B Nagothi <nag1010(at)gmail(dot)com>
> wrote in <CAEyp7J-4MmVwGoZSwvaSULZC80JDD_tL-9KsNiqF17+bNqiSBg(at)mail(dot)
> gmail.com>
> > On Tue, Jan 17, 2017 at 9:36 PM, Kyotaro HORIGUCHI <
> > horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> > > I managed to reproduce this. A little tweak as the first patch
> > > lets the standby to suicide as soon as walreceiver sees a
> > > contrecord at the beginning of a segment.
> > >
> > > - M(aster): createdb as a master with wal_keep_segments = 0
> > > (default), min_log_messages = debug2
> > > - M: Create a physical repslot.
> > > - S(tandby): Setup a standby database.
> > > - S: Edit recovery.conf to use the replication slot above then
> > > start it.
> > > - S: touch /tmp/hoge
> > > - M: Run pgbench ...
> > > - S: After a while, the standby stops.
> > > > LOG: #################### STOP THE SERVER
> > >
> > > - M: Stop pgbench.
> > > - M: Do 'checkpoint;' twice.
> > > - S: rm /tmp/hoge
> > > - S: Fails to catch up with the following error.
> > >
> > > > FATAL: could not receive data from WAL stream: ERROR: requested
> WAL
> > > segment 00000001000000000000002B has already been removed
> > >
> > >
> > I have been testing / reviewing the latest patch
> > "0001-Fix-a-bug-of-physical-replication-slot.patch" and i think, i might
> > need some more clarification on this.
> >
> > Before applying the patch, I tried re-producing the above error -
> >
> > - I had master->standby in streaming replication
> > - Took the backup of master
> > - with a low max_wal_size and wal_keep_segments = 0
> > - Configured standby with recovery.conf
> > - Created replication slot on master
> > - Configured the replication slot on standby and started the standby
>
> I suppose the "configure" means primary_slot_name in recovery.conf.
>
> > - I got the below error
> >
> > >> 2017-03-10 11:58:15.704 AEDT [478] LOG: invalid record length at
> > 0/F2000140: wanted 24, got 0
> > >> 2017-03-10 11:58:15.706 AEDT [481] LOG: started streaming WAL from
> > primary at 0/F2000000 on timeline 1
> > >> 2017-03-10 11:58:15.706 AEDT [481] FATAL: could not receive data
> > from WAL stream: ERROR: requested WAL segment 0000000100000000000000F2
> has
> > already been removed
>
> Maybe you created the master slot with non-reserve (default) mode
> and put a some-minites pause after making the backup and before
> starting the standby. For the case the master slot doesn't keep
> WAL segments unless the standby connects so a couple of
> checkpoints can blow away the first segment required by the
> standby. This is quite reasonable behavior. The following steps
> makes this more sure.
>
> > - Took the backup of master
> > - with a low max_wal_size = 2 and wal_keep_segments = 0
> > - Configured standby with recovery.conf
> > - Created replication slot on master
> + - SELECT pg_switch_wal(); on master twice.
> + - checkpoint; on master twice.
> > - Configured the replication slot on standby and started the standby
>
> Creating the slot with the following command will save it.
>
> =# select pg_create_physical_replication_slot('s1', true);
>

I did a test again, by applying the patch and I am not sure if the patch is
doing the right thing ?

Here is test case -

- I ran pgbench
- I took the backup of the master first

- Below are the WALs on master after the stop backup -

postgres=# select pg_stop_backup();

NOTICE: WAL archiving is not enabled; you must ensure that all required
WAL segments are copied through other means to complete the backup
pg_stop_backup
----------------
0/8C000130
(1 row)

postgres=# \q
[dba(at)buildhost data]$ ls -ltrh pgdata-10dev-prsb-1/pg_wal/
total 65M
drwx------. 2 dba dba 4.0K Mar 31 09:36 archive_status
-rw-------. 1 dba dba 16M Mar 31 11:09 00000001000000000000008E
-rw-------. 1 dba dba 16M Mar 31 11:17 00000001000000000000008F
-rw-------. 1 dba dba 16M Mar 31 11:18 00000001000000000000008C
-rw-------. 1 dba dba 16M Mar 31 11:18 00000001000000000000008D

- After the backup, i created the physical replication slot

postgres=# select pg_create_physical_replication_slot('repslot',true);

pg_create_physical_replication_slot
-------------------------------------
(repslot,0/8D000028)
(1 row)

postgres=# select pg_walfile_name('0/8D000028');

pg_walfile_name
---------------------------------------
00000001000000000000008D
(1 row)

Here, When you start the standby, it would ask for the file
00000001000000000000008C, which is the first file needed for the standby
and since i applied your patch, i am assuming that, the file
00000001000000000000008C should also be retained without being removed -
correct ?

- I started the standby and the below error occurs

>> 2017-03-31 11:26:01.288 AEDT [17475] LOG: invalid record length at
0/8C000108: wanted 24, got 0
>> 2017-03-31 11:26:01.291 AEDT [17486] LOG: started streaming WAL from
primary at 0/8C000000 on timeline 1
>> 2017-03-31 11:26:01.291 AEDT [17486] FATAL: could not receive data from
WAL stream: ERROR: requested WAL segment 00000001000000000000008C has
already been removed

> and i could notice that the file "0000000100000000000000F2" was removed
> > from the master. This can be easily re-produced and this occurs
> > irrespective of configuring replication slots.
> >
> > As long as the file "0000000100000000000000F2" is available on the
> master,
> > standby continues to stream WALs without any issues.
> ...
> > If the scenario i created to reproduce the error is correct, then,
> applying
> > the patch is not making a difference.
>
> Yes, the patch is not for saving this case. The patch saves the
> case where the previous segment to the first required segment by
> standby was removed and it contains the first part of a record
> continues to the first required segment. On the other hand this
> case is that the segment at the start point of standby is just
> removed.
>

Which means, the file 00000001000000000000008C must be retained as it is
the first file standby is looking for - correct ?

Regards,
Venkata Balaji N

Database Consultant

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Kyotaro HORIGUCHI 2017-03-31 03:58:50 Re: BUG #14606: Memory usage continue to increase
Previous Message Sam Dai 2017-03-30 16:30:31 Re: BUG #14606: Memory usage continue to increase

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-03-31 02:40:11 Re: [POC] A better way to expand hash indexes.
Previous Message Vitaly Burovoy 2017-03-31 02:39:30 Re: sequence data type