Re: Logical decoding from promoted standby with same replication slot

From: Jeremy Finzel <finzelj(at)gmail(dot)com>
To: PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical decoding from promoted standby with same replication slot
Date: 2018-07-16 15:24:08
Message-ID: CAMa1XUgJBo5qaP5BhAobwqutx9NWX2VAc56w_mdZOmMWgPE38Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jul 13, 2018 at 2:30 PM, Jeremy Finzel <finzelj(at)gmail(dot)com> wrote:

> Hello -
>
> We are working on several DR scenarios with logical decoding. Although we
> are using pglogical the question we have I think is generally applicable to
> logical replication.
>
> Say we have need to drop a logical replication slot for some emergency
> reason on the master, but we don't want to lose the data permanently. We
> can make a point-in-time-recovery snapshot of the master to use in order to
> recover the lost data in the slot we are about to drop. Then we drop the
> slot on master.
>
> We can then point our logical subscription to pull from the snapshot to
> get the lost data, once we promote it.
>
> The question is that after promotion, logical decoding is looking for a
> timeline 2 file whereas the file is still at timeline 1.
>
> The WAL file is 00000001000008FD0000003C, for example. After promotion,
> it is still 00000001000008FD0000003C in pg_wal. But logical decoding says
> ERROR: segment 00000002000008FD0000003C has already been removed (it is
> looking for a timeline 2 WAL file). Simply renaming the file actually
> allows us to stream from the replication slot accurately and recover the
> data.
>
> But all of this begs the question of an easier way to do this - why
> doesn't logical decoding know to look for a timeline 1 file? It is really
> helpful to have this ability to easily recover logical replicated data from
> a snapshot of a replication slot, in case of disaster.
>
> All thoughts very welcome!
>
> Thanks,
> Jeremy
>

I'd like to bump this question with some elaboration on my original
question: is it possible to do a *controlled* failover reliably with
logical decoding, assuming there are unconsumed changes in the replication
slot that client still needs?

It is rather easy to do a controlled failover if we can verify there are no
unconsumed changes in the slot before failover. Then, we just recreate the
slot on the promoted standby while clients are locked out, and we have not
missed any data changes.

I am trying to figure out if the problem of following timelines, as per
this wiki for example: https://wiki.postgresql.org/wiki/Failover_slots, can
be worked around in a controlled scenario. One additional part of this is
that after failover I have 2 WAL files with the same walfile name but on
differing timelines, and the promoted standby is only going to decode from
the latter. Does that mean I am likely to lose data?

Part of the reason I ask is because in testing, I have NOT lost data in
doing a controlled failover as described above (i.e. with unconsumed
changes in the slot that I need to replay on promoted standby). I am
trying to figure out if I've gotten lucky or if this method is actually
reliable. That is, renaming the WAL files to bump the timeline, since
these WAL files are simply identical to the ones that were played on the
master, and thus ought to show the same logical decoding information to be
consumed.

Thank you!
Jeremy

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2018-07-16 15:24:28 Re: New GUC to sample log queries
Previous Message Robert Haas 2018-07-16 15:22:07 Re: Refactor documentation for wait events (Was: pgsql: Add wait event for fsync of WAL segments)