RE: [PATCH] Reuse Workers and Replication Slots during Logical Replication

From: "shiy(dot)fnst(at)fujitsu(dot)com" <shiy(dot)fnst(at)fujitsu(dot)com>
To: shveta malik <shveta(dot)malik(at)gmail(dot)com>, Melih Mutlu <m(dot)melihmutlu(at)gmail(dot)com>
Cc: "wangw(dot)fnst(at)fujitsu(dot)com" <wangw(dot)fnst(at)fujitsu(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: RE: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Date: 2023-02-07 02:48:47
Message-ID: OSZPR01MB6310E85DDF68E9431338A323FDDB9@OSZPR01MB6310.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On Thu, Feb 2, 2023 11:48 AM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
>
> On Wed, Feb 1, 2023 at 5:42 PM Melih Mutlu <m(dot)melihmutlu(at)gmail(dot)com>
> wrote:
> >
> > Hi Shveta,
> >
> > shveta malik <shveta(dot)malik(at)gmail(dot)com>, 1 Şub 2023 Çar, 15:01 tarihinde
> şunu yazdı:
> >>
> >> On Wed, Feb 1, 2023 at 5:05 PM Melih Mutlu <m(dot)melihmutlu(at)gmail(dot)com>
> wrote:
> >> 2) I found a crash in the previous patch (v9), but have not tested it
> >> on the latest yet. Crash happens when all the replication slots are
> >> consumed and we are trying to create new. I tweaked the settings like
> >> below so that it can be reproduced easily:
> >> max_sync_workers_per_subscription=3
> >> max_replication_slots = 2
> >> and then ran the test case shared by you. I think there is some memory
> >> corruption happening. (I did test in debug mode, have not tried in
> >> release mode). I tried to put some traces to identify the root-cause.
> >> I observed that worker_1 keeps on moving from 1 table to another table
> >> correctly, but at some point, it gets corrupted i.e. origin-name
> >> obtained for it is wrong and it tries to advance that and since that
> >> origin does not exist, it asserts and then something else crashes.
> >> From log: (new trace lines added by me are prefixed by shveta, also
> >> tweaked code to have my comment 1 fixed to have clarity on worker-id).
> >>
> >> form below traces, it is clear that worker_1 was moving from one
> >> relation to another, always getting correct origin 'pg_16688_1', but
> >> at the end it got 'pg_16688_49' which does not exist. Second part of
> >> trace shows who updated 'pg_16688_49', it was done by worker_49
> which
> >> even did not get chance to create this origin due to max_rep_slot
> >> reached.
> >
> >
> > Thanks for investigating this error. I think it's the same error as the one Shi
> reported earlier. [1]
> > I couldn't reproduce it yet but will apply your tweaks and try again.
> > Looking into this.
> >
> > [1] https://www.postgresql.org/message-
> id/OSZPR01MB631013C833C98E826B3CFCB9FDC69%40OSZPR01MB6310.jpn
> prd01.prod.outlook.com
> >
>
> Hi Melih,
> I think I am able to identify the root cause. It is not memory
> corruption, but the way origin-names are stored in system-catalog
> mapped to a particular relation-id before even those are created.
>
> After adding few more logs:
>
> [4227] LOG: shveta- LogicalRepSyncTableStart- worker_49 constructed
> originname :pg_16684_49, relid:16540
> [4227] LOG: shveta- LogicalRepSyncTableStart- worker_49
> updated-origin in system catalog:pg_16684_49,
> slot:pg_16684_sync_49_7195149685251088378, relid:16540
> [4227] LOG: shveta- LogicalRepSyncTableStart- worker_49
> get-origin-id2 originid:0, originname:pg_16684_49
> [4227] ERROR: could not create replication slot
> "pg_16684_sync_49_7195149685251088378": ERROR: all replication slots
> are in use
> HINT: Free one or increase max_replication_slots.
>
>
> [4428] LOG: shveta- LogicalRepSyncTableStart- worker_148 constructed
> originname :pg_16684_49, relid:16540
> [4428] LOG: could not drop replication slot
> "pg_16684_sync_49_7195149685251088378" on publisher: ERROR:
> replication slot "pg_16684_sync_49_7195149 685251088378" does not
> exist
> [4428] LOG: shveta- LogicalRepSyncTableStart- worker_148 drop-origin
> originname:pg_16684_49
> [4428] LOG: shveta- LogicalRepSyncTableStart- worker_148
> updated-origin:pg_16684_49,
> slot:pg_16684_sync_148_7195149685251088378, relid:16540
>
> So from above, worker_49 came and picked up relid:16540, constructed
> origin-name and slot-name and updated in system-catalog and then it
> tried to actually create that slot and origin but since max-slot count
> was reached, it failed and exited, but did not do cleanup from system
> catalog for that relid.
>
> Then worker_148 came and also picked up table 16540 since it was not
> completed/started by previous worker, but this time it found that
> origin and slot entry present in system-catalog against this relid, so
> it picked the same names and started processing, but since those do
> not exist, it asserted while advancing the origin.
>
> The assert is only reproduced when an already running worker (say
> worker_1) who has 'created=true' set, gets to sync the relid for which
> a previously failed worker has tried and updated origin-name w/o
> creating it. In such a case worker_1 (with created=true) will try to
> reuse the origin and thus will try to advance it and will end up
> asserting. That is why you might not see the assertion always. The
> condition 'created' is set to true for that worker and it goes to
> reuse the origin updated by the previous worker.
>
> So to fix this, I think either we update origin and slot entries in
> the system catalog after the creation has passed or we clean-up the
> system catalog in case of failure. What do you think?
>

I think the first way seems better.

I reproduced the problem I reported before with latest patch (v7-0001,
v10-0002), and looked into this problem. It is caused by a similar reason. Here
is some analysis for the problem I reported [1].#6.

First, a tablesync worker (worker-1) started for "tbl1", its originname is
"pg_16398_1". And it exited because of unique constraint. In
LogicalRepSyncTableStart(), originname in pg_subscription_rel is updated when
updating table state to DATASYNC, and the origin is created when updating table
state to FINISHEDCOPY. So when it exited with state DATASYNC , the origin is not
created but the originname has been updated in pg_subscription_rel.

Then a tablesync worker (worker-2) started for "tbl2", its originname is
"pg_16398_2". After tablesync of "tbl2" finished, this worker moved to sync
table "tbl1". In LogicalRepSyncTableStart(), it got the originname of "tbl1" -
"pg_16398_1", by calling ReplicationOriginNameForLogicalRep(), and tried to drop
the origin (although it is not actually created before). After that, it called
replorigin_by_name to get the originid whose name is "pg_16398_1" and the result
is InvalidOid. Origin won't be created in this case because the sync worker has
created a replication slot (when it synced tbl2), so the originid was still
invalid and it caused an assertion failure when calling replorigin_advance().

It seems we don't need to drop previous origin in worker-2 because the previous
origin was not created in worker-1. I think one way to fix it is to not update
originname of pg_subscription_rel when setting state to DATASYNC, and only do
that when setting state to FINISHEDCOPY. If so, the originname in
pg_subscription_rel will be set at the same time the origin is created.
(Besides, the slotname seems need to be updated when setting state to DATASYNC,
because the previous slot might have been created successfully and we need to get
the previous slotname and drop that.)

[1] https://www.postgresql.org/message-id/OSZPR01MB631013C833C98E826B3CFCB9FDC69%40OSZPR01MB6310.jpnprd01.prod.outlook.com

Regards,
Shi yu

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hayato Kuroda (Fujitsu) 2023-02-07 02:52:11 RE: Time delayed LR (WAS Re: logical replication restrictions)
Previous Message Hayato Kuroda (Fujitsu) 2023-02-07 02:03:05 RE: Perform streaming logical transactions by background workers and parallel apply