RE: Exit walsender before confirming remote flush in logical replication

From: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
To: 'Kyotaro Horiguchi' <horikyota(dot)ntt(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, "amit(dot)kapila16(at)gmail(dot)com" <amit(dot)kapila16(at)gmail(dot)com>, "ashutosh(dot)bapat(dot)oss(at)gmail(dot)com" <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
Subject: RE: Exit walsender before confirming remote flush in logical replication
Date: 2022-12-23 12:54:15
Message-ID: TYAPR01MB5866CCD2C21790FEBE944034F5E99@TYAPR01MB5866.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dear Horiguchi-san,

> Thus how about before entering an apply_delay, logrep worker sending a
> kind of crafted feedback, which reports commit_data.end_lsn as
> flushpos? A little tweak is needed in send_feedback() but seems to
> work..

Thanks for replying! I tested your saying but it could not work well...

I made PoC based on the latest time-delayed patches [1] for non-streaming case.
Apply workers that are delaying applications send begin_data.final_lsn as recvpos and flushpos in send_feedback().

Followings were contents of the feedback message I got, and we could see that recv and flush were overwritten.

```
DEBUG: sending feedback (force 1) to recv 0/1553638, write 0/1553550, flush 0/1553638
CONTEXT: processing remote data for replication origin "pg_16390" during message type "BEGIN" in transaction 730, finished at 0/1553638
```

In terms of walsender, however, sentPtr seemed to be slightly larger than flushed position on subscriber.

```
(gdb) p MyWalSnd->sentPtr
$2 = 22361760
(gdb) p MyWalSnd->flush
$3 = 22361656
(gdb) p *MyWalSnd
$4 = {pid = 28807, state = WALSNDSTATE_STREAMING, sentPtr = 22361760, needreload = false, write = 22361656,
flush = 22361656, apply = 22361424, writeLag = 20020343, flushLag = 20020343, applyLag = 20020343,
sync_standby_priority = 0, mutex = 0 '\000', latch = 0x7ff0350cbb94, replyTime = 725113263592095}
```

Therefore I could not shut down the publisher node when applications were delaying.
Do you have any opinions about them?

```
$ pg_ctl stop -D data_pub/
waiting for server to shut down............................................................... failed
pg_ctl: server does not shut down
```

[1]: https://www.postgresql.org/message-id/TYCPR01MB83730A3E21E921335F6EFA38EDE89@TYCPR01MB8373.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2022-12-23 12:59:13 Re: daitch_mokotoff module
Previous Message David Rowley 2022-12-23 12:10:31 Re: Avoid lost result of recursion (src/backend/optimizer/util/inherit.c)