Re: Logical replication is missing block of rows when sending initial sync?

From: hubert depesz lubaczewski <depesz(at)depesz(dot)com>
To: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
Cc: PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: Logical replication is missing block of rows when sending initial sync?
Date: 2023-10-30 15:55:55
Message-ID: ZT/SC0A3+9tIE8qi@depesz.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Mon, Oct 30, 2023 at 07:10:35AM +0000, Hayato Kuroda (Fujitsu) wrote:
> * Table definitions on both nodes

Well, table has 50 columns, 28 indexes, 10 fkeys, is referenced back by 26 tables, has two simple triggers.
Showing it all would be rather complicated and time consuming (legal issues).

Used datatypes:

- 14 columns with bigint
- 8 columns with character varying(255)
- 7 columns with timestamp without time zone
- 6 columns with boolean
- 3 columns with text
- 3 columns with integer
- 3 columns with double precision
- 2 columns with character varying(16)
- 1 column with uuid
- 1 column with numeric(6,2)
- 1 column with character varying(5)
- 1 column with character varying

I know it's not full \d, but I don't think sharing it in full is realistic.

Schema on recipient is restored from schema dump made ~ 5 minutes earlier.

> * GUC settings

Which GUCs?

> * Publication/subscription options

=# \dRp focal14
List of publications
Name │ Owner │ All tables │ Inserts │ Updates │ Deletes │ Truncates
─────────┼──────────┼────────────┼─────────┼─────────┼─────────┼───────────
focal14 │ postgres │ f │ t │ t │ t │ t
(1 row)

and

=# \dRs focal14
List of subscriptions
Name │ Owner │ Enabled │ Publication
─────────┼──────────┼─────────┼─────────────
focal14 │ postgres │ t │ {focal14}
(1 row)

> * Server log and output when you met the failure

The failure was just fail of adding fkey. These logs are gone now, I can't
keep failed host when I'm in process of preparing boxes for upgrade.

Error message was:

ALTER TABLE ONLY c33s16831.ag_change_records_2023_10 ADD CONSTRAINT fk_rails_0113b4b993 FOREIGN KEY (sub_id) REFERENCES c33s16831.subs(id);
ERROR: insert or update on table "ag_change_records_2023_10" violates foreign key constraint "fk_rails_0113b4b993"
DETAIL: Key (sub_id)=(19733141) is not present in table "subs".

> * A minimal script for reproducing the failure (if you succeeded)

No. I think it's some kind of race condition.

> * Are there incoming data on publisher during the initial sync?

Yes. Pg12 is working as normal primary server for cluster that is having traffic 24h/day.

> * A method how you checked pg_subscription_rel

#v+
WITH info AS (
SELECT
format(
'%s:%s',
sr.srsubstate,
count(*)
) AS part,
min( o.ord ) AS ord
FROM
pg_subscription AS s
JOIN pg_subscription_rel AS sr ON s.oid = sr.srsubid
JOIN unnest( '{i,d,f,s,r}'::text[] ) WITH ORDINALITY AS o ( s, ord ) ON sr.srsubstate = o.s
WHERE
s.subname = 'focal14'
GROUP BY
sr.srsubstate
)
SELECT
string_agg(
part,
' ; '
order by ord
)
FROM
info;
#v-

This query I used while adding tables. Add 40 tables to publication, wait for all states to be 0, with exception of "r".
Afterwards I used:

#v+
SELECT
n.nspname,
c.relname
FROM
pg_subscription s
join pg_subscription_rel sr on s.oid = sr.srsubid
join pg_class c on sr.srrelid = c.oid
join pg_namespace n on c.relnamespace = n.oid
WHERE
s.subname = 'focal14'
ORDER BY n.nspname, c.relname
#v-

and compared it against

#v+
SELECT
schemaname,
tablename
FROM
pg_publication_tables
WHERE
pubname = 'focal14'
ORDER BY schemaname, tablename
#v-

> Could you share an error message what you met? I thought you saw like
> "Key (XXX)=(YYY) is not present in table "ZZZ".", but please paste here.

#v+
ALTER TABLE ONLY c33s16831.ag_change_records_2023_10 ADD CONSTRAINT fk_rails_0113b4b993 FOREIGN KEY (sub_id) REFERENCES c33s16831.subs(id);
ERROR: insert or update on table "ag_change_records_2023_10" violates foreign key constraint "fk_rails_0113b4b993"
DETAIL: Key (sub_id)=(19733141) is not present in table "subs".
#v-

> If so, are tuples really missing at that time?

Yes. I checked, and the whole block of values were missing - like I showed in previous mail.

> FYI - Below workflow is a brief description for initial synchronization.
> 1. Tablesync worker connects to publisher
> 2. Starts a transaction on both pub/sub
> 3. Creates a replication slot on publisher
> 4. Executes "COPY ... TO STDOUT" on publisher. This means that all tuples are extracted
> 5. Does BeginCopyFrom() and CopyFrom(), they are same infrastacture as normall "COPY FROM" statement.
> 6. Commits the transaction on both pub/sub
> 7. Catches up changes since the replication slot was defined at 4.

I kinda, understand it. That's why I thought it's important to report that
sometimes, something, somewhere breaks.

Since my project is time sensitive, I have to proceed, and if such problem
happens on my boxes, I just redo the whole sync from scratch.

So I don't mind that there is bug, as I have workaround, but I think it would
be cool to figure out what could be the potential source, and perhaps find a
fix.

> > Unfortunately I don't have a way to repeat the problem. Just noticed
> > that on some dbs that we're upgrading the problem is more common, and on
> > some - it generally doesn't happen.
> Just to confirm - You met the error only when the logical replication was set
> up from old clusters to newer, right?

Yes. That's what we use LR for. For upgrades.

It happens randomly, on some db clusters more than on others, but all in all,
at most I had to rebuilt a upgrade box 3 times, and the problem was gone.
Usually (in > 90% of cases) there is no problem.

Best regards,

depesz

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Bruce Momjian 2023-10-30 16:00:38 Re: BUG #15172: Postgresql ts_headline with <-> operator does not highlight text properly
Previous Message Tom Lane 2023-10-30 15:32:26 Re: BUG #15172: Postgresql ts_headline with <-> operator does not highlight text properly