Re: Synchronizing slots from primary to standby

From: "Hsu, John" <hsuchen(at)amazon(dot)com>
To: Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Synchronizing slots from primary to standby
Date: 2021-12-16 02:15:42
Message-ID: 2415E2B4-F79E-4C24-A28E-78D40721D08F@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

I started taking a brief look at the v2 patch, and it does appear to work for the basic case. Logical slot is synchronized across and I can connect to the promoted standby and stream changes afterwards.

It's not clear to me what the correct behavior is when a logical slot that has been synced to the replica and then it gets deleted on the writer. Would we expect this to be propagated or leave it up to the end-user to manage?

> + rawname = pstrdup(standby_slot_names);
> + SplitIdentifierString(rawname, ',', &namelist);
> +
> + while (true)
> + {
> + int wait_slots_remaining;
> + XLogRecPtr oldest_flush_pos = InvalidXLogRecPtr;
> + int rc;
> +
> + wait_slots_remaining = list_length(namelist);
> +
> + LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
> + for (int i = 0; i < max_replication_slots; i++)
> + {

Even though standby_slot_names is PGC_SIGHUP, we never reload/re-process the value. If we have a wrong entry in there, the backend becomes stuck until we re-establish the logical connection. Adding "postmaster/interrupt.h" with ConfigReloadPending / ProcessConfigFile does seem to work.

Another thing I noticed is that once it starts waiting in this block, Ctrl+C doesn't seem to terminate the backend?

pg_recvlogical -d postgres -p 5432 --slot regression_slot --start -f -
..
^Cpg_recvlogical: error: unexpected termination of replication stream:

The logical backend connection is still present:

ps aux | grep 51263
hsuchen 51263 80.7 0.0 320180 14304 ? Rs 01:11 3:04 postgres: walsender hsuchen [local] START_REPLICATION

pstack 51263
#0 0x00007ffee99e79a5 in clock_gettime ()
#1 0x00007f8705e88246 in clock_gettime () from /lib64/libc.so.6
#2 0x000000000075f141 in WaitEventSetWait ()
#3 0x000000000075f565 in WaitLatch ()
#4 0x0000000000720aea in ReorderBufferProcessTXN ()
#5 0x00000000007142a6 in DecodeXactOp ()
#6 0x000000000071460f in LogicalDecodingProcessRecord ()

It can be terminated with a pg_terminate_backend though.

If we have a physical slot with name foo on the standby, and then a logical slot is created on the writer with the same slot_name it does error out on the replica although it prevents other slots from being synchronized which is probably fine.

2021-12-16 02:10:29.709 UTC [73788] LOG: replication slot synchronization worker for database "postgres" has started
2021-12-16 02:10:29.713 UTC [73788] ERROR: cannot use physical replication slot for logical decoding
2021-12-16 02:10:29.714 UTC [73037] DEBUG: unregistering background worker "replication slot synchronization worker"

On 12/14/21, 2:26 PM, "Peter Eisentraut" <peter(dot)eisentraut(at)enterprisedb(dot)com> wrote:

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

On 28.11.21 07:52, Bharath Rupireddy wrote:
> 1) Instead of a new LIST_SLOT command, can't we use
> READ_REPLICATION_SLOT (slight modifications needs to be done to make
> it support logical replication slots and to get more information from
> the subscriber).

I looked at that but didn't see an obvious way to consolidate them.
This is something we could look at again later.

> 2) How frequently the new bg worker is going to sync the slot info?
> How can it ensure that the latest information exists say when the
> subscriber is down/crashed before it picks up the latest slot
> information?

The interval is currently hardcoded, but could be a configuration
setting. In the v2 patch, there is a new setting that orders physical
replication before logical so that the logical subscribers cannot get
ahead of the physical standby.

> 3) Instead of the subscriber pulling the slot info, why can't the
> publisher (via the walsender or a new bg worker maybe?) push the
> latest slot info? I'm not sure we want to add more functionality to
> the walsender, if yes, isn't it going to be much simpler?

This sounds like the failover slot feature, which was rejected.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message wangw.fnst@fujitsu.com 2021-12-16 02:27:06 RE: Confused comment about drop replica identity index
Previous Message Michael Paquier 2021-12-16 01:39:05 Re: pg_upgrade should truncate/remove its logs before running