Re: Synchronizing slots from primary to standby

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: James Coleman <jtc331(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, kato-sho(at)fujitsu(dot)com
Subject: Re: Synchronizing slots from primary to standby
Date: 2022-02-28 05:04:25
Message-ID: CALj2ACWfG0qUW1m7HffpvjjxoqZX-a9EipYMUOn+pTbA2pCHYw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Feb 24, 2022 at 12:46 AM James Coleman <jtc331(at)gmail(dot)com> wrote:
> I've been working on adding test coverage to prove this out, but I've
> encountered the problem reported in [1].
>
> My assumption, but Andres please correct me if I'm wrong, that we
> should see issues with the following steps (given the primary,
> physical replica, and logical subscriber already created in the test):
>
> 1. Ensure both logical subscriber and physical replica are caught up
> 2. Disable logical subscription
> 3. Make a catalog change on the primary (currently renaming the
> primary key column)
> 4. Vacuum pg_class
> 5. Ensure physical replication is caught up
> 6. Stop primary and promote the replica
> 7. Write to the changed table
> 8. Update subscription to point to promoted replica
> 9. Re-enable logical subscription
>
> I'm attaching my test as an additional patch in the series for
> reference. Currently I have steps 3 and 4 commented out to show that
> the issues in [1] occur without any attempt to trigger the catalog
> xmin problem.
>
> Given this error seems pretty significant in terms of indicating
> fundamental lack of test coverage (the primary stated benefit of the
> patch is physical failover), and it currently is a blocker to testing
> more deeply.

Few of my initial concerns specified at [1] are this:

1) Instead of a new LIST_SLOT command, can't we use
READ_REPLICATION_SLOT (slight modifications needs to be done to make
it support logical replication slots and to get more information from
the subscriber).

2) How frequently the new bg worker is going to sync the slot info?
How can it ensure that the latest information exists say when the
subscriber is down/crashed before it picks up the latest slot
information?

4) IIUC, the proposal works only for logical replication slots but do
you also see the need for supporting some kind of synchronization of
physical replication slots as well? IMO, we need a better and
consistent way for both types of replication slots. If the walsender
can somehow push the slot info from the primary (for physical
replication slots)/publisher (for logical replication slots) to the
standby/subscribers, this will be a more consistent and simplistic
design. However, I'm not sure if this design is doable at all.

Can anyone help clarify these?

[1] https://www.postgresql.org/message-id/CALj2ACUGNGfWRtwwZwT-Y6feEP8EtOMhVTE87rdeY14mBpsRUA%40mail.gmail.com

Regards,
Bharath Rupireddy.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2022-02-28 05:59:32 Re: why do hash index builds use smgrextend() for new splitpoint pages
Previous Message Bharath Rupireddy 2022-02-28 04:51:23 Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)