Re: Handle infinite recursion in logical replication setup

From: vignesh C <vignesh21(at)gmail(dot)com>
To: Peter Smith <smithpb2250(at)gmail(dot)com>
Cc: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, "kuroda(dot)hayato(at)fujitsu(dot)com" <kuroda(dot)hayato(at)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Handle infinite recursion in logical replication setup
Date: 2022-04-12 06:52:22
Message-ID: CALDaNm3LrDOSykDKDGoK6Dyrqq-nsxyH_LhYHrjEgijbUVXVuw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Apr 12, 2022 at 10:26 AM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
>
> On Thu, Apr 7, 2022 at 2:09 PM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
> >
> > FYI, here is a test script that is using the current patch (v6) to
> > demonstrate a way to share table data between different numbers of
> > nodes (up to 5 of them here).
> >
> > The script starts off with just 2-way sharing (nodes N1, N2),
> > then expands to 3-way sharing (nodes N1, N2, N3),
> > then 4-way sharing (nodes N1, N2, N3, N4),
> > then 5-way sharing (nodes N1, N2, N3, N4, N5).
> >
> > As an extra complication, for this test, all 5 nodes have different
> > initial table data, which gets replicated to the others whenever each
> > new node joins the existing share group.
> >
> > PSA.
> >
>
>
> Hi Vignesh. I had some problems getting the above test script working.
> It was OK up until I tried to join the 5th node (N5) to the existing 4
> nodes. The ERROR was manifesting itself strangely because it appeared
> that there was an index violation in the pg_subscription_rel catalog
> even though AFAIK the N5 did not have any entries in it.
>
>
> e.g.
> 2022-04-07 09:13:28.361 AEST [24237] ERROR: duplicate key value
> violates unique constraint "pg_subscription_rel_srrelid_srsubid_index"
> 2022-04-07 09:13:28.361 AEST [24237] DETAIL: Key (srrelid,
> srsubid)=(16384, 16393) already exists.
> 2022-04-07 09:13:28.361 AEST [24237] STATEMENT: create subscription
> sub51 connection 'port=7651' publication pub1 with
> (subscribe_local_only=true,copy_data=force);
> 2022-04-07 09:13:28.380 AEST [24237] ERROR: duplicate key value
> violates unique constraint "pg_subscription_rel_srrelid_srsubid_index"
> 2022-04-07 09:13:28.380 AEST [24237] DETAIL: Key (srrelid,
> srsubid)=(16384, 16394) already exists.
> 2022-04-07 09:13:28.380 AEST [24237] STATEMENT: create subscription
> sub52 connection 'port=7652' publication pub2 with
> (subscribe_local_only=true,copy_data=false);
> 2022-04-07 09:13:28.405 AEST [24237] ERROR: duplicate key value
> violates unique constraint "pg_subscription_rel_srrelid_srsubid_index"
> 2022-04-07 09:13:28.405 AEST [24237] DETAIL: Key (srrelid,
> srsubid)=(16384, 16395) already exists.
> 2022-04-07 09:13:28.405 AEST [24237] STATEMENT: create subscription
> sub53 connection 'port=7653' publication pub3 with
> (subscribe_local_only=true,copy_data=false);
> 2022-04-07 09:13:28.425 AEST [24237] ERROR: duplicate key value
> violates unique constraint "pg_subscription_rel_srrelid_srsubid_index"
> 2022-04-07 09:13:28.425 AEST [24237] DETAIL: Key (srrelid,
> srsubid)=(16384, 16396) already exists.
> 2022-04-07 09:13:28.425 AEST [24237] STATEMENT: create subscription
> sub54 connection 'port=7654' publication pub4 with
> (subscribe_local_only=true,copy_data=false);
> 2022-04-07 09:17:52.472 AEST [25852] ERROR: duplicate key value
> violates unique constraint "pg_subscription_rel_srrelid_srsubid_index"
> 2022-04-07 09:17:52.472 AEST [25852] DETAIL: Key (srrelid,
> srsubid)=(16384, 16397) already exists.
> 2022-04-07 09:17:52.472 AEST [25852] STATEMENT: create subscription
> sub51 connection 'port=7651' publication pub1;
>
> ~~~
>
> When I debugged this it seemed like each of the CREAT SUBSCRIPTION was
> trying to make a double-entry, because the fetch_tables (your patch
> v6-0002 modified SQL of this) was retuning the same table 2x.
>
> (gdb) bt
> #0 errfinish (filename=0xbc1057 "nbtinsert.c", lineno=671,
> funcname=0xbc25e0 <__func__.15798> "_bt_check_unique") at elog.c:510
> #1 0x0000000000526d83 in _bt_check_unique (rel=0x7f654219c2a0,
> insertstate=0x7ffd9629ddd0, heapRel=0x7f65421b0e28,
> checkUnique=UNIQUE_CHECK_YES, is_unique=0x7ffd9629de01,
> speculativeToken=0x7ffd9629ddcc) at nbtinsert.c:664
> #2 0x0000000000526157 in _bt_doinsert (rel=0x7f654219c2a0,
> itup=0x19ea8e8, checkUnique=UNIQUE_CHECK_YES, indexUnchanged=false,
> heapRel=0x7f65421b0e28) at nbtinsert.c:208
> #3 0x000000000053450e in btinsert (rel=0x7f654219c2a0,
> values=0x7ffd9629df10, isnull=0x7ffd9629def0, ht_ctid=0x19ea894,
> heapRel=0x7f65421b0e28, checkUnique=UNIQUE_CHECK_YES,
> indexUnchanged=false, indexInfo=0x19dea80) at nbtree.c:201
> #4 0x00000000005213b6 in index_insert (indexRelation=0x7f654219c2a0,
> values=0x7ffd9629df10, isnull=0x7ffd9629def0, heap_t_ctid=0x19ea894,
> heapRelation=0x7f65421b0e28, checkUnique=UNIQUE_CHECK_YES,
> indexUnchanged=false, indexInfo=0x19dea80) at indexam.c:193
> #5 0x00000000005c81d5 in CatalogIndexInsert (indstate=0x19de540,
> heapTuple=0x19ea890) at indexing.c:158
> #6 0x00000000005c8325 in CatalogTupleInsert (heapRel=0x7f65421b0e28,
> tup=0x19ea890) at indexing.c:231
> #7 0x00000000005f0170 in AddSubscriptionRelState (subid=16400,
> relid=16384, state=105 'i', sublsn=0) at pg_subscription.c:315
> #8 0x00000000006d6fa5 in CreateSubscription (pstate=0x1942dc0,
> stmt=0x191f6a0, isTopLevel=true) at subscriptioncmds.c:767
>
> ~~
>
> Aside: All this was happening when I did not have enough logical
> replication workers configured. (There were WARNINGS in the logfile
> that I had not noticed).
> When I fix the configuration then all these other problems went away!
>
> ~~
>
> So to summarize, I'm not sure if the fetch_tables still has some
> potential problem lurking or not, but I feel that the SQL in that
> function maybe needs a closer look to ensure it is always impossible
> to return the same table multiple times.

I have earlier used a distinct of srsubstate, but there is a
possibility that when multiple subscriptions are being created there
can be multiple entries because the srsubstate can be different like i
and r. I have changed srsubstate to srrelid to get the unique values.
The attached v8 patch has the changes for the same.

Regards,
Vignesh

Attachment Content-Type Size
v8-0002-Support-force-option-for-copy_data-check-and-thro.patch text/x-patch 43.6 KB
v8-0001-Skip-replication-of-non-local-data.patch text/x-patch 51.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message bucoo@sohu.com 2022-04-12 06:57:16 fix cost subqueryscan wrong parallel cost
Previous Message Amit Kapila 2022-04-12 06:49:22 Re: Skipping schema changes in publication