Re: BUG #14538: streaming replication same wal missing

From: Vladimir Svedov <vodevsh(at)gmail(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: PostgreSQL mailing lists <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #14538: streaming replication same wal missing
Date: 2017-02-10 09:10:19
Message-ID: CADqDLE_qKLcOrrwD4kduVd915-axQedDp6qNn7Q9WNP9x-HcBg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Thank you Michael,
Sorry guys for bothering you!
It is not a bug - it is nothing more then my lack of knowledge of ERRORs
produced and panic.
I mistook error as produced by slave not finding WAL at master, which would
be "FATAL: could not receive data from WAL stream".
As Michael said error was produced by one of cascaded slaves I forgot to
take down
I did not know how to call my bug report back, unless somebody answers it,
so I'm doing it only now.
Thank you and my apologies again

2017-02-10 1:48 GMT+00:00 Michael Paquier <michael(dot)paquier(at)gmail(dot)com>:

> On Thu, Feb 9, 2017 at 10:58 PM, <vodevsh(at)gmail(dot)com> wrote:
> > Bug reference: 14538
> > Logged by: Vladimir Svedov
> > Email address: vodevsh(at)gmail(dot)com
> > PostgreSQL version: 9.3.11
>
> You should really update to 9.3.16, the latest minor release in the
> 9.3 series. You are missing more than 1 year worth of bug fixes here.
>
> > Several days ago I saw this in master log:
> >
> > replica 2017-02-06 16:10:55 UTC ERROR: requested WAL segment
> > 000000030000096D00000052 has already been removed
> > This message was repeating until I stopped slave.
>
> That happens :"(
>
> > Checking pg_stat_activity on prod showed autovacuum task on table
> ~250GB, I
> > decided that it produced huge amount of WALs. Checking them confirmed my
> > assumption - all wal_keep_segments=500 were last minute. So I thought
> that
> > network bandwidth was not sufficient to send WALs fast enough to slave to
> > replay them. Master has archive_command = cd ., so no way back.
> >
> > I did not panic, I recreate replication. I:
> >
> > delete all data files on slave
> > pg_start_backup() on prod
> > rsync -zahv data
> > pg_start slave
>
> Does this line mean that you are starting the slave? You may want to
> start it after issuing pg_stop_backup() so a segment switch is done
> and a standby is able to fetch all the data it needs has been properly
> archived...
>
> > pg_stop_backup() on prod and check logs.
> > Imagine my desperation to see that slave has this:
> > -bash-4.2$ head -n 20 /pg/data93/pg_log/postgresql-2017-02-09_121514.log
> > LOG: could not open usermap file "/pg/data93/pg_ident.conf": No such
> file
> > or directory
> > LOG: database system was shut down in recovery at 2017-02-09 12:15:07
> UTC
> > LOG: entering standby mode
> > LOG: redo starts at 982/39074640
> > FATAL: the database system is starting up
> > FATAL: the database system is starting up
> > FATAL: the database system is starting up
> > LOG: consistent recovery state reached at 982/3AEC60F8
> > LOG: unexpected pageaddr 97C/83EC8000 in log segment
> > 00000003000009820000003A, offset 15499264
> > LOG: database system is ready to accept read only connections
> > LOG: started streaming WAL from primary at 982/3A000000 on timeline 3
> > ERROR: requested WAL segment 000000030000096D00000052 has already been
> > removed
>
> This error message is being generated by a WAL sender running on your
> new *slave*, meaning that there is a cascading standby connected to
> this slave server and trying to fetch this WAL segment. But it does
> not have it as it has been recycled already during recovery.
>
> If the slave you just created was complaining about a missing segment
> from its primary it would throw that instead:
> FATAL: could not receive data from WAL stream: ERROR: requested WAL
> segment 000000010000000000000007 has already been removed
>
> In short your new slave is fine, but you need to rebuild the cascading
> standby behind though.
> --
> Michael
>

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message opldupuy 2017-02-10 15:33:29 BUG #14540: REASSIGN OWNED cannot reassign from a USER to a ROLE
Previous Message Nick Barnes 2017-02-10 08:01:55 Re: BUG #14530: Logical Decoding Slowness