Re: Synchronizing slots from primary to standby

From: "Hsu, John" <hsuchen(at)amazon(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Synchronizing slots from primary to standby
Date: 2022-01-21 23:02:50
Message-ID: BF248F5F-013D-49B8-810D-14F620819869@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> I might be missing something but isn’t it okay even if the new primary
> server is behind the subscribers? IOW, even if two slot's LSNs (i.e.,
> restart_lsn and confirm_flush_lsn) are behind the subscriber's remote
> LSN (i.e., pg_replication_origin.remote_lsn), the primary sends only
> transactions that were committed after the remote_lsn. So the
> subscriber can resume logical replication with the new primary without
> any data loss.

Maybe I'm misreading, but I thought the purpose of this to make
sure that the logical subscriber does not have data that has not been
replicated to the new primary. The use-case I can think of would be
if synchronous_commit were enabled and fail-over occurs. If
we didn't have this set, isn't it possible that this logical subscriber
has extra commits that aren't present on the newly promoted primary?

And sorry I accidentally started a new thread in my last reply.
Re-pasting some of my previous questions/comments:

wait_for_standby_confirmation does not update standby_slot_names once it's
in a loop and can't be fixed with SIGHUP. Similarly, synchronize_slot_names
isn't updated once the worker is launched.

If a logical slot was dropped on the writer, should the worker drop logical
slots that it was previously synchronizing but are no longer present? Or
should we leave that to the user to manage? I'm trying to think why users
would want to sync logical slots to a reader but not have that be dropped
as well if it's no longer present.

Is there a reason we're deciding to use one-worker syncing per database
instead of one general worker that syncs across all the databases?
I imagine I'm missing something obvious here.

As for how standby_slot_names should be configured, I'd prefer the
flexibility similar to what we have for synchronus_standby_names since
that seems the most analogous. It'd provide flexibility for failovers,
which I imagine is the most common use-case.

On 1/20/22, 9:34 PM, "Masahiko Sawada" <sawada(dot)mshk(at)gmail(dot)com> wrote:

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

On Wed, Dec 15, 2021 at 7:13 AM Peter Eisentraut
<peter(dot)eisentraut(at)enterprisedb(dot)com> wrote:
>
> On 31.10.21 11:08, Peter Eisentraut wrote:
> > I want to reactivate $subject. I took Petr Jelinek's patch from [0],
> > rebased it, added a bit of testing. It basically works, but as
> > mentioned in [0], there are various issues to work out.
> >
> > The idea is that the standby runs a background worker to periodically
> > fetch replication slot information from the primary. On failover, a
> > logical subscriber would then ideally find up-to-date replication slots
> > on the new publisher and can just continue normally.
>
> > So, again, this isn't anywhere near ready, but there is already a lot
> > here to gather feedback about how it works, how it should work, how to
> > configure it, and how it fits into an overall replication and HA
> > architecture.
>
> The second,
> standby_slot_names, is set on the primary. It holds back logical
> replication until the listed physical standbys have caught up. That
> way, when failover is necessary, the promoted standby is not behind the
> logical replication consumers.

I might be missing something but isn’t it okay even if the new primary
server is behind the subscribers? IOW, even if two slot's LSNs (i.e.,
restart_lsn and confirm_flush_lsn) are behind the subscriber's remote
LSN (i.e., pg_replication_origin.remote_lsn), the primary sends only
transactions that were committed after the remote_lsn. So the
subscriber can resume logical replication with the new primary without
any data loss.

The new primary should not be ahead of the subscribers because it
forwards the logical replication start LSN to the slot’s
confirm_flush_lsn in this case. But it cannot happen since the remote
LSN of the subscriber’s origin is always updated first, then the
confirm_flush_lsn of the slot on the primary is updated, and then the
confirm_flush_lsn of the slot on the standby is synchronized.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2022-01-21 23:04:03 Re: fairywren is generating bogus BASE_BACKUP commands
Previous Message Andres Freund 2022-01-21 22:53:25 Re: New developer papercut - Makefile references INSTALL