Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load

From: Zane Duffield <duffieldzane(at)gmail(dot)com>
To: Euler Taveira <euler(at)eulerto(dot)com>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org, shlok(dot)kyal(dot)oss(at)gmail(dot)com
Subject: Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load
Date: 2025-04-23 03:13:42
Message-ID: CACMiCkUm5gwcoS2=jap1vkrS_n+FbFLWY5XQJ8ssFc8BUCxGCg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi Euler, thanks for your reply.

On Wed, Apr 23, 2025 at 11:58 AM Euler Taveira <euler(at)eulerto(dot)com> wrote:

> On Wed, Apr 16, 2025, at 8:14 PM, PG Bug reporting form wrote:
>
> I'm in the process of converting our databases from pglogical logical
> replication to the native logical replication implementation on PostgreSQL
> 17. One of the bugs we encountered and had to work around with pglogical
> was
> the plugin dropping records while converting to a streaming replica to
> logical via pglogical_create_subscriber (reported
> https://github.com/2ndQuadrant/pglogical/issues/349) I was trying to
> confirm that the native logical replication implementation did not have
> this
> problem, and I've found that it might have a different problem.
>
>
> pg_createsubscriber uses a different approach than pglogical. While
> pglogical
> uses a restore point, pg_createsubscriber uses the LSN from the latest
> replication slot as a replication start point. The restore point approach
> is
> usually suitable to physical replication but might not cover all scenarios
> for
> logical replication (such as when there are in progress transactions).
> Since
> creating a logical replication slot does find a consistent decoding start
> point, it is a natural choice to start the logical replication (that also
> needs
> to find a decoding start point).
>
> I should say that I've been operating under the assumption that
> pg_createsubscriber is designed for use on a replica for a *live* primary
> database, if this isn't correct then someone please let me know.
>
>
> pg_createsubscriber expects a physical replica that is preferably stopped
> before running it.
>

I think pg_createsubscriber actually gives you an error if the replica is
not stopped. I was talking about the primary.

> Your script is not waiting enough time until it applies the backlog.
> Unless,
> you are seeing a different symptom, there is no bug.
>
> You should have used something similar to wait_for_subscription_sync
> routine
> (Cluster.pm) before counting the rows. That's what is used in the
> pg_createsubscriber tests. It guarantees the subscriber has caught up.
>
>
It may be true that the script doesn't wait long enough for all systems,
but when I reproduced the issue on my machine(s) I confirmed that the
logical decoder process was properly stuck on a conflicting primary key,
rather than just catching up.

From the log file

> 2025-04-16 09:17:16.090 AEST [3845786] port=5341 ERROR: duplicate key
> value violates unique constraint "test_table_pkey"
> 2025-04-16 09:17:16.090 AEST [3845786] port=5341 DETAIL: Key (f1)=(20700)
> already exists.
> 2025-04-16 09:17:16.090 AEST [3845786] port=5341 CONTEXT: processing
> remote data for replication origin "pg_24576" during message type "INSERT"
> for replication target relation "public.test_table" in transaction 1581,
> finished at 0/3720058
> 2025-04-16 09:17:16.091 AEST [3816845] port=5341 LOG: background worker
> "logical replication apply worker" (PID 3845786) exited with exit code 1

wait_for_subscription_sync sounds like a better solution than what I
have, but you might still be able to reproduce the problem if you increase
the sleep interval on line 198.

I wonder if Shlok could confirm whether they found the conflicting primary
key in their reproduction?

Thanks,
Zane

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Zane Duffield 2025-04-23 03:30:47 Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load
Previous Message Kirill Reshke 2025-04-23 02:59:19 Re: Command order bug in pg_dump