Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

From: Melih Mutlu <m(dot)melihmutlu(at)gmail(dot)com>
To: shveta malik <shveta(dot)malik(at)gmail(dot)com>, "shiy(dot)fnst(at)fujitsu(dot)com" <shiy(dot)fnst(at)fujitsu(dot)com>
Cc: "wangw(dot)fnst(at)fujitsu(dot)com" <wangw(dot)fnst(at)fujitsu(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Date: 2023-02-22 12:51:35
Message-ID: CAGPVpCQ8m0Jcy2gip=qLis7XPyO2bkb1ZMUBWSXQcyv3tEtz2Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Melih Mutlu <m(dot)melihmutlu(at)gmail(dot)com>, 16 Şub 2023 Per, 14:37 tarihinde şunu
yazdı:

> I see that setting originname in the catalog before actually creating it
> causes issues. My concern with setting originname when setting the state to
> FINISHEDCOPY is that if worker waits until FINISHEDCOPY to update
> slot/origin name but it fails somewhere before reaching FINISHEDCOPY and
> after creating slot/origin, then this new created slot/origin will be left
> behind. It wouldn't be possible to find and drop them since their names are
> not stored in the catalog. Eventually, this might also cause hitting
> the max_replication_slots limit in case of such failures between origin
> creation and FINISHEDCOPY.
>
> One fix I can think is to update the catalog right after creating a new
> origin. But this would also require commiting the current transaction to
> actually persist the originname. I guess this action of commiting the
> transaction in the middle of initial sync could hurt the copy process.
>

Here are more thoughts on this:
I still believe that updating originname when setting state
to FINISHEDCOPY is not a good idea since any failure
before FINISHEDCOPY prevent us to store originname in the catalog. If an
origin or slot is not in the catalog, it's not easily possible to find and
drop origins/slot that are left behind. And we definitely do not want to
keep unnecessary origins/slots since we would hit max_replication_slots
limit.
It's better to be safe and update origin/slot names when setting state
to DATASYNC. At this point, the worker must be sure that it writes correct
origin/slot names into the catalog.
Following part actually cleans up the catalog if a table is left behind in
DATASYNC state and its slot and origin cannot be used for sync.

ReplicationSlotDropAtPubNode(LogRepWorkerWalRcvConn, prev_slotname, true);
>
> StartTransactionCommand();
> /* Replication origin might still exist. Try to drop */
> replorigin_drop_by_name(originname, true, false);
>
> /*
> * Remove replication slot and origin name from the relation's
> * catalog record
> */
> UpdateSubscriptionRel(MyLogicalRepWorker->subid,
> MyLogicalRepWorker->relid,
> MyLogicalRepWorker->relstate,
> MyLogicalRepWorker->relstate_lsn,
> NULL,
> NULL);
>

The patch needs to refresh the origin name before it begins copying the
table. It will try to read from the catalog but won't find any slot/origin
since they are cleaned. Then, it will move on with the correct origin name
which is the one created/will be created for the current sync worker.

I tested refetching originname and seems like it fixes the errors you
reported.

Thanks,
--
Melih Mutlu
Microsoft

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Melih Mutlu 2023-02-22 12:56:00 Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Previous Message wangw.fnst@fujitsu.com 2023-02-22 12:40:07 RE: Fix the description of GUC "max_locks_per_transaction" and "max_pred_locks_per_transaction" in guc_table.c