Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

From: Melih Mutlu <m(dot)melihmutlu(at)gmail(dot)com>
To: shveta malik <shveta(dot)malik(at)gmail(dot)com>, "shiy(dot)fnst(at)fujitsu(dot)com" <shiy(dot)fnst(at)fujitsu(dot)com>
Cc: "wangw(dot)fnst(at)fujitsu(dot)com" <wangw(dot)fnst(at)fujitsu(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Date: 2023-02-16 11:37:19
Message-ID: CAGPVpCQZZ5SPtVYSGSBTQEgp0eWV9qEtHcLNAywkd-CJ7M-Umg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Shveta and Shi,

Thanks for your investigations.

shveta malik <shveta(dot)malik(at)gmail(dot)com>, 8 Şub 2023 Çar, 16:49 tarihinde şunu
yazdı:

> On Tue, Feb 7, 2023 at 8:18 AM shiy(dot)fnst(at)fujitsu(dot)com
> <shiy(dot)fnst(at)fujitsu(dot)com> wrote:
> > I reproduced the problem I reported before with latest patch (v7-0001,
> > v10-0002), and looked into this problem. It is caused by a similar
> reason. Here
> > is some analysis for the problem I reported [1].#6.
> >
> > First, a tablesync worker (worker-1) started for "tbl1", its originname
> is
> > "pg_16398_1". And it exited because of unique constraint. In
> > LogicalRepSyncTableStart(), originname in pg_subscription_rel is updated
> when
> > updating table state to DATASYNC, and the origin is created when
> updating table
> > state to FINISHEDCOPY. So when it exited with state DATASYNC , the
> origin is not
> > created but the originname has been updated in pg_subscription_rel.
> >
> > Then a tablesync worker (worker-2) started for "tbl2", its originname is
> > "pg_16398_2". After tablesync of "tbl2" finished, this worker moved to
> sync
> > table "tbl1". In LogicalRepSyncTableStart(), it got the originname of
> "tbl1" -
> > "pg_16398_1", by calling ReplicationOriginNameForLogicalRep(), and tried
> to drop
> > the origin (although it is not actually created before). After that, it
> called
> > replorigin_by_name to get the originid whose name is "pg_16398_1" and
> the result
> > is InvalidOid. Origin won't be created in this case because the sync
> worker has
> > created a replication slot (when it synced tbl2), so the originid was
> still
> > invalid and it caused an assertion failure when calling
> replorigin_advance().
> >
> > It seems we don't need to drop previous origin in worker-2 because the
> previous
> > origin was not created in worker-1. I think one way to fix it is to not
> update
> > originname of pg_subscription_rel when setting state to DATASYNC, and
> only do
> > that when setting state to FINISHEDCOPY. If so, the originname in
> > pg_subscription_rel will be set at the same time the origin is created.
>
> +1. Update of system-catalog needs to be done carefully and only when
> origin is created.
>

I see that setting originname in the catalog before actually creating it
causes issues. My concern with setting originname when setting the state to
FINISHEDCOPY is that if worker waits until FINISHEDCOPY to update
slot/origin name but it fails somewhere before reaching FINISHEDCOPY and
after creating slot/origin, then this new created slot/origin will be left
behind. It wouldn't be possible to find and drop them since their names are
not stored in the catalog. Eventually, this might also cause hitting
the max_replication_slots limit in case of such failures between origin
creation and FINISHEDCOPY.

One fix I can think is to update the catalog right after creating a new
origin. But this would also require commiting the current transaction to
actually persist the originname. I guess this action of commiting the
transaction in the middle of initial sync could hurt the copy process.

What do you think?

Also; working on an updated patch to address your other comments. Thanks
again.

Best,
--
Melih Mutlu
Microsoft

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message gkokolatos 2023-02-16 11:38:00 RE: Add LZ4 compression in pg_dump
Previous Message Daniel Gustafsson 2023-02-16 11:34:23 Re: Missing default value of createrole_self_grant in document