| From: | Zane Duffield <duffieldzane(at)gmail(dot)com> |
|---|---|
| To: | Euler Taveira <euler(at)eulerto(dot)com> |
| Cc: | pgsql-bugs(at)lists(dot)postgresql(dot)org, shlok(dot)kyal(dot)oss(at)gmail(dot)com |
| Subject: | Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load |
| Date: | 2025-04-23 03:30:47 |
| Message-ID: | CACMiCkX9PkKsV6qHkhChUtthFHq2+fjvUKgOp5=x2CqM-mqRbQ@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-bugs |
I meant to say the logical *apply *worker was stuck, not the decoder
process.
On Wed, Apr 23, 2025 at 1:13 PM Zane Duffield <duffieldzane(at)gmail(dot)com>
wrote:
> Hi Euler, thanks for your reply.
>
> On Wed, Apr 23, 2025 at 11:58 AM Euler Taveira <euler(at)eulerto(dot)com> wrote:
>
>> On Wed, Apr 16, 2025, at 8:14 PM, PG Bug reporting form wrote:
>>
>> I'm in the process of converting our databases from pglogical logical
>> replication to the native logical replication implementation on PostgreSQL
>> 17. One of the bugs we encountered and had to work around with pglogical
>> was
>> the plugin dropping records while converting to a streaming replica to
>> logical via pglogical_create_subscriber (reported
>> https://github.com/2ndQuadrant/pglogical/issues/349) I was trying to
>> confirm that the native logical replication implementation did not have
>> this
>> problem, and I've found that it might have a different problem.
>>
>>
>> pg_createsubscriber uses a different approach than pglogical. While
>> pglogical
>> uses a restore point, pg_createsubscriber uses the LSN from the latest
>> replication slot as a replication start point. The restore point approach
>> is
>> usually suitable to physical replication but might not cover all
>> scenarios for
>> logical replication (such as when there are in progress transactions).
>> Since
>> creating a logical replication slot does find a consistent decoding start
>> point, it is a natural choice to start the logical replication (that also
>> needs
>> to find a decoding start point).
>>
>> I should say that I've been operating under the assumption that
>> pg_createsubscriber is designed for use on a replica for a *live* primary
>> database, if this isn't correct then someone please let me know.
>>
>>
>> pg_createsubscriber expects a physical replica that is preferably stopped
>> before running it.
>>
>
> I think pg_createsubscriber actually gives you an error if the replica is
> not stopped. I was talking about the primary.
>
>
>> Your script is not waiting enough time until it applies the backlog.
>> Unless,
>> you are seeing a different symptom, there is no bug.
>>
>> You should have used something similar to wait_for_subscription_sync
>> routine
>> (Cluster.pm) before counting the rows. That's what is used in the
>> pg_createsubscriber tests. It guarantees the subscriber has caught up.
>>
>>
> It may be true that the script doesn't wait long enough for all systems,
> but when I reproduced the issue on my machine(s) I confirmed that the
> logical decoder process was properly stuck on a conflicting primary key,
> rather than just catching up.
>
> From the log file
>
>> 2025-04-16 09:17:16.090 AEST [3845786] port=5341 ERROR: duplicate key
>> value violates unique constraint "test_table_pkey"
>> 2025-04-16 09:17:16.090 AEST [3845786] port=5341 DETAIL: Key
>> (f1)=(20700) already exists.
>> 2025-04-16 09:17:16.090 AEST [3845786] port=5341 CONTEXT: processing
>> remote data for replication origin "pg_24576" during message type "INSERT"
>> for replication target relation "public.test_table" in transaction 1581,
>> finished at 0/3720058
>> 2025-04-16 09:17:16.091 AEST [3816845] port=5341 LOG: background worker
>> "logical replication apply worker" (PID 3845786) exited with exit code 1
>
>
> wait_for_subscription_sync sounds like a better solution than what I
> have, but you might still be able to reproduce the problem if you increase
> the sleep interval on line 198.
>
> I wonder if Shlok could confirm whether they found the conflicting primary
> key in their reproduction?
>
> Thanks,
> Zane
>
| From | Date | Subject | |
|---|---|---|---|
| Next Message | PG Bug reporting form | 2025-04-23 04:48:23 | BUG #18902: TRAP:: failed Assert("!is_sorted") in File: "createplan.c" |
| Previous Message | Zane Duffield | 2025-04-23 03:13:42 | Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load |