From: | "Euler Taveira" <euler(at)eulerto(dot)com> |
---|---|
To: | duffieldzane(at)gmail(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org |
Subject: | Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load |
Date: | 2025-04-23 01:57:53 |
Message-ID: | 86e60caf-d994-4b97-b504-22db9c825c52@app.fastmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
On Wed, Apr 16, 2025, at 8:14 PM, PG Bug reporting form wrote:
> I'm in the process of converting our databases from pglogical logical
> replication to the native logical replication implementation on PostgreSQL
> 17. One of the bugs we encountered and had to work around with pglogical was
> the plugin dropping records while converting to a streaming replica to
> logical via pglogical_create_subscriber (reported
> https://github.com/2ndQuadrant/pglogical/issues/349) I was trying to
> confirm that the native logical replication implementation did not have this
> problem, and I've found that it might have a different problem.
pg_createsubscriber uses a different approach than pglogical. While pglogical
uses a restore point, pg_createsubscriber uses the LSN from the latest
replication slot as a replication start point. The restore point approach is
usually suitable to physical replication but might not cover all scenarios for
logical replication (such as when there are in progress transactions). Since
creating a logical replication slot does find a consistent decoding start
point, it is a natural choice to start the logical replication (that also needs
to find a decoding start point).
> I should say that I've been operating under the assumption that
> pg_createsubscriber is designed for use on a replica for a *live* primary
> database, if this isn't correct then someone please let me know.
pg_createsubscriber expects a physical replica that is preferably stopped
before running it.
> I have a script that I've been using to reproduce the issue (pasted at end
> of email, because this bug reporting page doesn't seem to support
> attachments). It basically performs a loop that sets up a primary and a
> physical replica, generates some load, converts the replica to logical,
> waits, and makes sure the row counts are the same.
If I run your tests, it reports
$ NUM_THREADS=40 INSERT_WIDTH=1000 /tmp/logical_stress_test.sh
.
.
*** Successfully started logical replica on port 5341.
*** ALL INSERT LOOPS FINISHED
SOURCE COUNT = 916000
DEST COUNT = 768000
ERROR: record count mismatch
but after some time
$ psql -X -p 5340 -c "SELECT count(f1) FROM test_table" -d test_db
count
--------
916000
(1 row)
$ psql -X -p 5341 -c "SELECT count(f1) FROM test_table" -d test_db
count
--------
916000
(1 row)
I also checked the data
$ pg_dump -t test_table -p 5340 -d test_db -f - | sort > /tmp/p.out
$ pg_dump -t test_table -p 5341 -d test_db -f - | sort > /tmp/r.out
$ diff -q /tmp/p.out /tmp/r.out
$ echo $?
0
Your script is not waiting enough time until it applies the backlog. Unless,
you are seeing a different symptom, there is no bug.
You should have used something similar to wait_for_subscription_sync routine
(Cluster.pm) before counting the rows. That's what is used in the
pg_createsubscriber tests. It guarantees the subscriber has caught up.
--
Euler Taveira
EDB https://www.enterprisedb.com/
From | Date | Subject | |
---|---|---|---|
Next Message | Kirill Reshke | 2025-04-23 02:59:19 | Re: Command order bug in pg_dump |
Previous Message | Hayato Kuroda (Fujitsu) | 2025-04-23 01:41:20 | RE: Disabled logical replication origin session causes primary key errors |