Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

From: Melih Mutlu <m(dot)melihmutlu(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>
Cc: Peter Smith <smithpb2250(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, "Wei Wang (Fujitsu)" <wangw(dot)fnst(at)fujitsu(dot)com>, "Yu Shi (Fujitsu)" <shiy(dot)fnst(at)fujitsu(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Subject: Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Date: 2023-08-02 09:42:07
Message-ID: CAGPVpCTuwTwAh8V8EcaKyea+RTk32CWUVX5Der13jrgk8wB5_Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, 2 Ağu 2023 Çar, 12:01 tarihinde şunu
yazdı:

> I think we are getting the error (ERROR: could not find logical
> decoding starting point) because we wouldn't have waited for WAL to
> become available before reading it. It could happen due to the
> following code:
> WalSndWaitForWal()
> {
> ...
> if (streamingDoneReceiving && streamingDoneSending &&
> !pq_is_send_pending())
> break;
> ..
> }
>
> Now, it seems that in 0003 patch, instead of resetting flags
> streamingDoneSending, and streamingDoneReceiving before start
> replication, we should reset before create logical slots because we
> need to read the WAL during that time as well to find the consistent
> point.
>

Thanks for the suggestion Amit. I've been looking into this recently and
couldn't figure out the cause until now.
I quickly made the fix in 0003. Seems like it resolved the "could not find
logical decoding starting point" errors.

vignesh C <vignesh21(at)gmail(dot)com>, 1 Ağu 2023 Sal, 09:32 tarihinde şunu yazdı:

> I agree that "no copy in progress issue" issue has nothing to do with
> 0001 patch. This issue is present with the 0002 patch.
> In the case when the tablesync worker has to apply the transactions
> after the table is synced, the tablesync worker sends the feedback of
> writepos, applypos and flushpos which results in "No copy in progress"
> error as the stream has ended already. Fixed it by exiting the
> streaming loop if the tablesync worker is done with the
> synchronization. The attached 0004 patch has the changes for the same.
> The rest of v22 patches are the same patch that were posted by Melih
> in the earlier mail.

Thanks for the fix. I placed it into 0002 with a slight change as follows:

- send_feedback(last_received, false, false);
> + if (!MyLogicalRepWorker->relsync_completed)
> + send_feedback(last_received, false, false);

IMHO relsync_completed means simply the same with streaming_done, that's
why I wanted to check that flag instead of an additional goto statement.
Does it make sense to you as well?

Thanks,
--
Melih Mutlu
Microsoft

Attachment Content-Type Size
v23-0002-Reuse-Tablesync-Workers.patch application/octet-stream 10.3 KB
v23-0001-Refactor-to-split-Apply-and-Tablesync-Workers.patch application/octet-stream 25.4 KB
v23-0003-Reuse-connection-when-tablesync-workers-change-t.patch application/octet-stream 6.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrey Lepikhov 2023-08-02 09:43:19 Re: [PoC] Reducing planning time when tables have many partitions
Previous Message Masahiro Ikeda 2023-08-02 09:34:15 Re: Support to define custom wait events for extensions