Logical replication is missing block of rows when sending initial sync?

From: hubert depesz lubaczewski <depesz(at)depesz(dot)com>
To: PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Logical replication is missing block of rows when sending initial sync?
Date: 2023-10-27 13:33:13
Message-ID: ZTu8GTDajCkZVjMs@depesz.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi,
we have situation where we use native logical replication to copy data
from pg12 (on ubuntu bionic) to pg14 (on ubuntu focal).

Generally it mostly works.

The process is like this:

1. make all tables in destination, with *just* primary keys. no other
indexes, constraints
2. setup logical replication, adding tables in batches by 40, adding new
batch to publication only when all tables are in 'r' state
3. once all tables are replicating, add indexes, fkeys, trgiggers and
the like on recipient.

Every so often we got errors when adding fkeys. Usually I didn't have
much time to investigate, but today had a bit. Unfortunatley the replica
is gone (I had to re-set the replication). But when I had it there
I noticed that a block of rows from one of the tables was missing.

I am not entirely sure about the size of the block, but I check 20 rows
with consecutive ids, and they were all missing from focal side:

#v+
=# select ctid, xmin, id, created_at, updated_at from c33s16831.subs where id between 19733130 and 19733150 order by 1;
ctid │ xmin │ id │ created_at │ updated_at
─────────────┼────────────┼──────────┼────────────────────────────┼────────────────────────────
(486337,57) │ 2668461379 │ 19733130 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204
(486337,58) │ 2668461379 │ 19733131 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204
(486337,59) │ 2668461379 │ 19733132 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204
(486337,60) │ 2668461379 │ 19733133 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204
(486337,61) │ 2668461379 │ 19733134 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204
(486337,62) │ 2668461379 │ 19733135 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204
(486337,63) │ 2668461379 │ 19733136 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204
(486337,64) │ 2668461379 │ 19733137 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204
(486337,65) │ 2668461379 │ 19733138 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204
(486337,66) │ 2668461379 │ 19733139 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204
(486337,67) │ 2668461379 │ 19733140 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204
(486340,38) │ 2668691763 │ 19733147 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:20:07.360778
(486340,58) │ 2668695225 │ 19733146 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:20:15.227267
(486340,61) │ 2668711141 │ 19733144 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:20:43.499749
(486340,62) │ 2668717802 │ 19733149 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:20:49.900798
(486340,63) │ 2668723876 │ 19733141 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:21:00.749445
(486340,64) │ 2668753420 │ 19733142 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:22:02.140332
(486340,66) │ 2668761678 │ 19733150 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:22:19.140231
(486341,9) │ 2668726256 │ 19733145 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:21:06.845051
(486341,17) │ 2668748364 │ 19733148 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:21:50.246899
(486341,18) │ 2668750613 │ 19733143 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:21:55.863125
(21 rows)
#v-

Last vacuum (from autovacuum) was ~ 3 weeks ago, so it isn't a factor.

I don't know when exactly this batch was added to publication, but I found that COPY that sent initial sync of the data was logged at:

#v+
2023-10-25 18:48:43.982 UTC,"upguser","dbname",12995,"10.1.207.7:44956",65396265.32c3,6,"COPY",2023-10-25 18:45:57 UTC,74/160580191,0,LOG,00000,"duration: 166355.331 ms statement: COPY c33s16831.subs TO STDOUT",,,,,,,,,"pg_443910492_sync_443582288_7292150788357777097"
#v-

Unfortunately I don't have a way to repeat the problem. Just noticed
that on some dbs that we're upgrading the problem is more common, and on
some - it generally doesn't happen.

So, I don't have high hopes re: finding/fixing the issue, but figured
I'll let you know that there is such situation, perhaps it will strike
someone as something relating to something that could be the problem...

Best regards,

depesz

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Richard Guo 2023-10-27 14:10:31 Re: BUG #18170: Unexpected error: no relation entry for relid 3
Previous Message PG Bug reporting form 2023-10-27 11:02:14 BUG #18172: High memory usage in tSRF function context