Re: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Michail Nikolaev <michail(dot)nikolaev(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION
Date: 2023-01-02 12:49:39
Message-ID: CAA4eK1LYq+gJO6V34dVnnYy2adBxZDarvhhxTMFkxDr3Vh5OZg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Dec 28, 2022 at 4:52 PM Michail Nikolaev
<michail(dot)nikolaev(at)gmail(dot)com> wrote:
>
> Hello.
>
> > None of these entries are from the point mentioned by you [1]
> > yesterday where you didn't find the corresponding data in the
> > subscriber. How did you identify that the entries corresponding to
> > that timing were missing?
>
> Some of the before the interval, some after... But the source database
> was generating a lot of WAL during logical replication
> - some of these log entries from time AFTER completion of initial sync
> of B but (probably) BEFORE finishing B table catch up (entering
> streaming mode).
>
...
...
>
> So, shortly the story looks like:
>
> * initial sync of A (and other tables) started and completed, they are
> in streaming mode
> * B and C initial sync started (by altering PUBLICATION and SUBSCRIPTION)
> * B sync completed, but new changes are still applying to the tables
> to catch up primary
>

The point which is not completely clear from your description is the
timing of missing records. In one of your previous emails, you seem to
have indicated that the data missed from Table B is from the time when
the initial sync for Table B was in-progress, right? Also, from your
description, it seems there is no error or restart that happened
during the time of initial sync for Table B. Is that understanding
correct?

> * logical replication apply worker is restarted because IO error on WAL receive
> * Postgres killed
> * Postgres restarted
> * C initial sync restarted
> * logical replication apply worker few times restarted because IO
> error on WAL receive
> * finally every table in streaming mode but with small gap in B
>

I am not able to see how these steps can lead to the problem. If the
problem is reproducible at your end, you might want to increase LOG
verbosity to DEBUG1 and see if there is additional information in the
LOGs that can help or it would be really good if there is a
self-sufficient test to reproduce it.

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Geier 2023-01-02 13:28:20 Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Previous Message Dean Rasheed 2023-01-02 12:13:59 Bug in check for unreachable MERGE WHEN clauses