Exit walsender before confirming remote flush in logical replication

From: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Cc: 'Amit Kapila' <amit(dot)kapila16(at)gmail(dot)com>
Subject: Exit walsender before confirming remote flush in logical replication
Date: 2022-12-22 05:46:11
Message-ID: TYAPR01MB586668E50FC2447AD7F92491F5E89@TYAPR01MB5866.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dear hackers,
(I added Amit as CC because we discussed in another thread)

This is a fork thread from time-delayed logical replication [1].
While discussing, we thought that we could extend the condition of walsender shutdown[2][3].

Currently, walsenders delay the shutdown request until confirming all sent data
are flushed on remote side. This condition was added in 985bd7[4], which is for
supporting clean switchover. Supposing that there is a primary-secondary
physical replication system, and do following steps. If any changes are come
while step 2 but the walsender does not confirm the remote flush, the reboot in
step 3 may be failed.

1. Stops primary server.
2. Promotes secondary to new primary.
3. Reboot (old)primary as new secondary.

In case of logical replication, however, we cannot support the use-case that
switches the role publisher <-> subscriber. Suppose same case as above, additional
transactions are committed while doing step2. To catch up such changes subscriber
must receive WALs related with trans, but it cannot be done because subscriber
cannot request WALs from the specific position. In the case, we must truncate all
data in new subscriber once, and then create new subscription with copy_data
= true.

Therefore, I think that we can ignore the condition for shutting down the
walsender in logical replication.

This change may be useful for time-delayed logical replication. The walsender
waits the shutdown until all changes are applied on subscriber, even if it is
delayed. This causes that publisher cannot be stopped if large delay-time is
specified.

PSA the minimal patch for that. I'm not sure whether WalSndCaughtUp should be
also omitted or not. It seems that changes may affect other parts like
WalSndWaitForWal(), but we can investigate more about it.

[1]: https://commitfest.postgresql.org/41/3581/
[2]: https://www.postgresql.org/message-id/TYAPR01MB58661BA3BF38E9798E59AE14F5E19%40TYAPR01MB5866.jpnprd01.prod.outlook.com
[3]: https://www.postgresql.org/message-id/CAA4eK1LyetktcphdRrufHac4t5DGyhsS2xG2DSOGb7OaOVcDVg%40mail.gmail.com
[4]: https://github.com/postgres/postgres/commit/985bd7d49726c9f178558491d31a570d47340459

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachment Content-Type Size
0001-Exit-walsender-before-confirming-remote-flush-in-log.patch application/octet-stream 1.7 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hayato Kuroda (Fujitsu) 2022-12-22 05:50:03 RE: Time delayed LR (WAS Re: logical replication restrictions)
Previous Message Jeff Davis 2022-12-22 05:40:57 Re: Rework of collation code, extensibility