RE: Time delayed LR (WAS Re: logical replication restrictions)

From: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
To: 'vignesh C' <vignesh21(at)gmail(dot)com>
Cc: Euler Taveira <euler(at)eulerto(dot)com>, "Takamichi Osumi (Fujitsu)" <osumi(dot)takamichi(at)fujitsu(dot)com>, Melih Mutlu <m(dot)melihmutlu(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Marcos Pegoraro <marcos(at)f10(dot)com(dot)br>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Peter Smith <smithpb2250(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Subject: RE: Time delayed LR (WAS Re: logical replication restrictions)
Date: 2022-12-09 05:19:37
Message-ID: TYAPR01MB5866F6BE7399E6343A96E016F51C9@TYAPR01MB5866.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Vignesh,

> In the case of physical replication by setting
> recovery_min_apply_delay, I noticed that both primary and standby
> nodes were getting stopped successfully immediately after the stop
> server command. In case of logical replication, stop server fails:
> pg_ctl -D publisher -l publisher.log stop -c
> waiting for server to shut
> down...............................................................
> failed
> pg_ctl: server does not shut down
>
> In case of logical replication, the server does not get stopped
> because the walsender process is not able to exit:
> ps ux | grep walsender
> vignesh 1950789 75.3 0.0 8695216 22284 ? Rs 11:51 1:08
> postgres: walsender vignesh [local] START_REPLICATION

Thanks for reporting the issue. I analyzed about it.

This issue has occurred because the apply worker cannot reply during the delay.
I think we may have to modify the mechanism that delays applying transactions.

When walsender processes are requested to shut down, it can shut down only after
that all the sent WALs are replicated on the subscriber. This check is done in
WalSndDone(), and the replicated position will be updated when processes handle
the reply messages from a subscriber, in ProcessStandbyReplyMessage().

In the case of physical replication, the walreciever can receive WALs and reply
even if the application is delayed. It means that the replicated position will
be transported to the publisher side immediately. So the walsender can exit.

In terms of logical replication, however, the worker cannot reply to the
walsender while delaying the transaction with this patch at present. It causes
the replicated position to be never transported upstream and the walsender cannot
exit.

Based on the above analysis, we can conclude that the worker must update the
flushpos and reply to the walsender while delaying the transaction if we want
to solve the issue. This cannot be done in the current approach, and a newer
proposed one[1] may be able to solve this, although it's currently under discussion.

Note that a similar issue can reproduce while doing the physical replication.
When the wal_sender_timeout is set to 0 and the network between primary and
secondary is broken after that primary sends WALs to secondary, we cannot stop
the primary node.

[1]: https://www.postgresql.org/message-id/TYCPR01MB8373FA10EB2DB2BF8E458604ED1B9%40TYCPR01MB8373.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2022-12-09 06:05:02 Re: Perform streaming logical transactions by background workers and parallel apply
Previous Message Michael Paquier 2022-12-09 05:19:00 Re: [PATCH] Backport perl tests for pg_upgrade from 322becb60