From: | Michail Nikolaev <michail(dot)nikolaev(at)gmail(dot)com> |
---|---|
To: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION |
Date: | 2022-12-26 14:28:00 |
Message-ID: | CANtu0ohgHM_7+M+3i2kXkZuLCsN_Dbio2pFfphrH+oOfzLfUhw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hello.
Just a small story about small data-loss on logical replication.
We were logically replicating a 4 TB database from
> PostgreSQL 12.12 (Ubuntu 12.12-201-yandex.49163.d86383ed5b) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit
to
> PostgreSQL 14.5 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit
Database includes many tables, but there are A, B and C tables. Tables
A, B and are C changed in the same transaction (new tuples created in
B and C with corresponding update of A).
Table A was added to the PUBLICATION from the start. Once initial sync
was done, tables B and C were added to PUBLICATION with REFRESH on
subscription (to reduce the WAL collection on the source database).
So, we see in logs:
> -2022-12-13 13:19:55 UTC-63987bfb.2733-LOG: logical replication table synchronization worker for subscription "cloud_production_main_sub_v4", table "A" has started
> -2022-12-13 14:41:49 UTC-63987bfb.2733-LOG: logical replication table synchronization worker for subscription "cloud_production_main_sub_v4", table "A" has finished
> -2022-12-14 08:08:34 UTC-63998482.7d10-LOG: logical replication table synchronization worker for subscription "cloud_production_main_sub_v4", table "B" has started
> -2022-12-14 10:19:08 UTC-63998482.7d10-LOG: logical replication table synchronization worker for subscription "cloud_production_main_sub_v4", table "B" has finished
> -2022-12-14 10:37:47 UTC-6399a77b.1fc-LOG: logical replication table synchronization worker for subscription "cloud_production_main_sub_v4", table "C" has started
> -2022-12-14 10:48:46 UTC-6399a77b.1fc-LOG: logical replication table synchronization worker for subscription "cloud_production_main_sub_v4", table "C" has finished
Also, we had to reboot subscription server twice. Moreover, we have
plenty of messages like:
> -2022-12-13 15:53:30 UTC-639872fa.1-LOG: background worker "logical replication worker" (PID 47960) exited with exit code 1
> -2022-12-13 21:04:31 UTC-6398e8df.4f7c-LOG: logical replication apply worker for subscription "cloud_production_main_sub_v4" has started
> -2022-12-14 10:19:22 UTC-6398e8df.4f7c-ERROR: could not receive data from WAL stream: SSL SYSCALL error: EOF detected
Additionally, our HA replica of subscriber was broken and recovered by
support… And logs like this:
> psql-2022-12-14 10:24:18 UTC-63999d2c.2020-WARNING: user requested cancel while waiting for synchronous replication ack.
> The COMMIT record has already flushed to WAL locally and might not have been replicated to the standby. We must wait here.
Furthermore, we were adding\removing another table D from publication
few times. So, it was a little bit messy story.
After all, we got streaming working for the whole database.
But after some time we realized we have lost 203 records for table B
created from
2022-12-14 09:21:25.705 to
2022-12-14 09:49:20.664 (after synchronization start, but before finish).
And the most tricky part here - A, B and C are changed in the same
transaction (related tuples). But tables A and C - are fine, only few
records from B are lost.
We have compared all other tables record to record - only 203 records
from B are missing. We have restored the server from backup with
point-in-time-recovery (to exclude case with application or human
error) - the same results. Furthermore, we have tried different
indexes in search (to exclude issue with broken btree) - the same
results.
So, yes, we understand our replication story was not a classical happy
path even close. But the result feels a little bit scary.
Currently, I have access to database and logs - so, feel free to ask
for additional debugging information if you like.
Thanks a lot,
Michail.
From | Date | Subject | |
---|---|---|---|
Next Message | Michail Nikolaev | 2022-12-26 15:19:35 | Re: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION |
Previous Message | Dilip Kumar | 2022-12-26 14:07:00 | Re: Time delayed LR (WAS Re: logical replication restrictions) |