Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

From: shveta malik <shveta(dot)malik(at)gmail(dot)com>
To: "shiy(dot)fnst(at)fujitsu(dot)com" <shiy(dot)fnst(at)fujitsu(dot)com>
Cc: Melih Mutlu <m(dot)melihmutlu(at)gmail(dot)com>, "wangw(dot)fnst(at)fujitsu(dot)com" <wangw(dot)fnst(at)fujitsu(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Subject: Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Date: 2023-02-08 13:48:51
Message-ID: CAJpy0uDCUw8P+RorkNcsC77fnKXLzkdQUGKJu+s_ewBWTnQ92Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 7, 2023 at 8:18 AM shiy(dot)fnst(at)fujitsu(dot)com
<shiy(dot)fnst(at)fujitsu(dot)com> wrote:
>
>
> On Thu, Feb 2, 2023 11:48 AM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
> >
> >
> > So to fix this, I think either we update origin and slot entries in
> > the system catalog after the creation has passed or we clean-up the
> > system catalog in case of failure. What do you think?
> >
>
> I think the first way seems better.

Yes, I agree.

>
> I reproduced the problem I reported before with latest patch (v7-0001,
> v10-0002), and looked into this problem. It is caused by a similar reason. Here
> is some analysis for the problem I reported [1].#6.
>
> First, a tablesync worker (worker-1) started for "tbl1", its originname is
> "pg_16398_1". And it exited because of unique constraint. In
> LogicalRepSyncTableStart(), originname in pg_subscription_rel is updated when
> updating table state to DATASYNC, and the origin is created when updating table
> state to FINISHEDCOPY. So when it exited with state DATASYNC , the origin is not
> created but the originname has been updated in pg_subscription_rel.
>
> Then a tablesync worker (worker-2) started for "tbl2", its originname is
> "pg_16398_2". After tablesync of "tbl2" finished, this worker moved to sync
> table "tbl1". In LogicalRepSyncTableStart(), it got the originname of "tbl1" -
> "pg_16398_1", by calling ReplicationOriginNameForLogicalRep(), and tried to drop
> the origin (although it is not actually created before). After that, it called
> replorigin_by_name to get the originid whose name is "pg_16398_1" and the result
> is InvalidOid. Origin won't be created in this case because the sync worker has
> created a replication slot (when it synced tbl2), so the originid was still
> invalid and it caused an assertion failure when calling replorigin_advance().
>
> It seems we don't need to drop previous origin in worker-2 because the previous
> origin was not created in worker-1. I think one way to fix it is to not update
> originname of pg_subscription_rel when setting state to DATASYNC, and only do
> that when setting state to FINISHEDCOPY. If so, the originname in
> pg_subscription_rel will be set at the same time the origin is created.

+1. Update of system-catalog needs to be done carefully and only when
origin is created.

thanks
Shveta

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Aleksander Alekseev 2023-02-08 13:54:26 Re: Hi i am Intrested to contribute
Previous Message James Coleman 2023-02-08 13:38:27 Re: Parallelize correlated subqueries that execute within each worker