RE: Synchronizing slots from primary to standby

From: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
To: "Drouvot, Bertrand" <bertranddrouvot(dot)pg(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: shveta malik <shveta(dot)malik(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Ajin Cherian <itsajin(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: RE: Synchronizing slots from primary to standby
Date: 2023-11-13 13:57:28
Message-ID: OS0PR01MB5716CE0729CEB3B5994A954194B3A@OS0PR01MB5716.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Friday, November 10, 2023 4:16 PM Drouvot, Bertrand <bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
>
>
> >>> Also, the current coding doesn't ensure we will always give WARNING.
> >>> If we see the below code that deals with this WARNING,
> >>>
> >>> +  /* User created slot with the same name exists, emit WARNING. */
> >>> +  else if (found && s->data.sync_state == SYNCSLOT_STATE_NONE)
> >>> +  {
> >>> +    ereport(WARNING,
> >>> +        errmsg("not synchronizing slot %s; it is a user created slot",
> >>> +           remote_slot->name));
> >>> +  }
> >>> +  /* Otherwise create the slot first. */
> >>> +  else
> >>> +  {
> >>> +    TransactionId xmin_horizon = InvalidTransactionId;
> >>> +    ReplicationSlot *slot;
> >>> +
> >>> +    ReplicationSlotCreate(remote_slot->name, true, RS_EPHEMERAL,
> >>> +               remote_slot->two_phase, false);
> >>>
> >>> I think this is not a solid check to ensure that the slot existed
> >>> before. Because it could be created as soon as the slot sync worker
> >>> invokes ReplicationSlotCreate() here.
> >>
> >> Agree.
> >>
> >
> > So, having a concrete check to give WARNING would require some more
> > logic which I don't think is a good idea to handle this boundary case.
> >
>
> Yeah good point, agree to just error out in all the case then (if we discard the
> sync_ reserved wording proposal, which seems to be the case as probably not
> worth the extra work).

Thanks for the discussion!

Here is the V33 patch set which includes the following changes:

1) Drop slots with state 'i' in promotion flow after we shut down WalReceiver.

2) Raise error if user slot with same name already exists on standby.

3) Hold spinlock when updating the porperty of the replication slot or when
reading the slot's info without acuqiring it.

4) Fixed the bug:
- if slot is invalidated on standby but standby is restarted immediately after
that, then it fails to recreate that slot and instead end up syncing it
again. It is fixed by checking the invalidated flag after acquiring the slot
and skip syncing for invalidated slots.

5) Fixed the bugs reported by Ajin[1][2].

6) Removed some unused variables.

Thanks Shveta for working on 1) 2) 4) and 5).
Thanks Ajin for testing 5).

---
TODO
There are few pending comments and bugs have not been addressed, I will work on
them in next version:

1) Comments posted by me[3]:
2) Shutdown the slotsync worker on promotion and probably let slotsync do necessary cleanups[4]
3) Consider documenting the hazard that create slot on standby may cause ERRORs
in the log and consider making it visible in the view.
4) One bug found internally: If we give non-existing dbname in primary_conninfo
on standby, it should be handled gracefully, launcher should skip
slots-synchronization.
5) One bug found internally: when wait_for_primary_slot_catchup is doing
WaitLatch and I stop the standby using: ./pg_ctl -D ../../standbydb/ -l
standby.log stop it does not come out of wait and shutdown hangs.

[1] https://www.postgresql.org/message-id/CAFPTHDZNV_HFAXULkaJOv_MMtLukCzDEgTaixxBwjEO_0Jg-kg%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAFPTHDa5C_vHQbeqemToyucWySB0kEFbdS2WOA0PB%2BGSei2v7A%40mail.gmail.com
[3] https://www.postgresql.org/message-id/OS0PR01MB571652CCD42F1D08D5BD69D494B3A%40OS0PR01MB5716.jpnprd01.prod.outlook.com
[4] https://www.postgresql.org/message-id/64056e35-1916-461c-a816-26e40ffde3a0%40gmail.com

Best Regards,
Hou zj

Attachment Content-Type Size
v33-0003-Allow-slot-sync-workers-to-wait-for-the-cascadin.patch application/octet-stream 7.9 KB
v33-0001-Allow-logical-walsenders-to-wait-for-the-physica.patch application/octet-stream 125.7 KB
v33-0002-Add-logical-slot-sync-capability-to-the-physical.patch application/octet-stream 121.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2023-11-13 14:17:40 Re: add log messages when replication slots become active and inactive (was Re: Is it worth adding ReplicationSlot active_pid to ReplicationSlotPersistentData?)
Previous Message Daniel Gustafsson 2023-11-13 13:39:15 Re: proposal: possibility to read dumped table's name from file