Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

From: shveta malik <shveta(dot)malik(at)gmail(dot)com>
To: Melih Mutlu <m(dot)melihmutlu(at)gmail(dot)com>
Cc: "wangw(dot)fnst(at)fujitsu(dot)com" <wangw(dot)fnst(at)fujitsu(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, shiy(dot)fnst(at)fujitsu(dot)com, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Subject: Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Date: 2023-02-02 03:48:03
Message-ID: CAJpy0uB9M9v0rO-o=SOvRxj+thkF7zjiuU7w-XC8iK0SFDg57w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 1, 2023 at 5:42 PM Melih Mutlu <m(dot)melihmutlu(at)gmail(dot)com> wrote:
>
> Hi Shveta,
>
> shveta malik <shveta(dot)malik(at)gmail(dot)com>, 1 Şub 2023 Çar, 15:01 tarihinde şunu yazdı:
>>
>> On Wed, Feb 1, 2023 at 5:05 PM Melih Mutlu <m(dot)melihmutlu(at)gmail(dot)com> wrote:
>> 2) I found a crash in the previous patch (v9), but have not tested it
>> on the latest yet. Crash happens when all the replication slots are
>> consumed and we are trying to create new. I tweaked the settings like
>> below so that it can be reproduced easily:
>> max_sync_workers_per_subscription=3
>> max_replication_slots = 2
>> and then ran the test case shared by you. I think there is some memory
>> corruption happening. (I did test in debug mode, have not tried in
>> release mode). I tried to put some traces to identify the root-cause.
>> I observed that worker_1 keeps on moving from 1 table to another table
>> correctly, but at some point, it gets corrupted i.e. origin-name
>> obtained for it is wrong and it tries to advance that and since that
>> origin does not exist, it asserts and then something else crashes.
>> From log: (new trace lines added by me are prefixed by shveta, also
>> tweaked code to have my comment 1 fixed to have clarity on worker-id).
>>
>> form below traces, it is clear that worker_1 was moving from one
>> relation to another, always getting correct origin 'pg_16688_1', but
>> at the end it got 'pg_16688_49' which does not exist. Second part of
>> trace shows who updated 'pg_16688_49', it was done by worker_49 which
>> even did not get chance to create this origin due to max_rep_slot
>> reached.
>
>
> Thanks for investigating this error. I think it's the same error as the one Shi reported earlier. [1]
> I couldn't reproduce it yet but will apply your tweaks and try again.
> Looking into this.
>
> [1] https://www.postgresql.org/message-id/OSZPR01MB631013C833C98E826B3CFCB9FDC69%40OSZPR01MB6310.jpnprd01.prod.outlook.com
>

Hi Melih,
I think I am able to identify the root cause. It is not memory
corruption, but the way origin-names are stored in system-catalog
mapped to a particular relation-id before even those are created.

After adding few more logs:

[4227] LOG: shveta- LogicalRepSyncTableStart- worker_49 constructed
originname :pg_16684_49, relid:16540
[4227] LOG: shveta- LogicalRepSyncTableStart- worker_49
updated-origin in system catalog:pg_16684_49,
slot:pg_16684_sync_49_7195149685251088378, relid:16540
[4227] LOG: shveta- LogicalRepSyncTableStart- worker_49
get-origin-id2 originid:0, originname:pg_16684_49
[4227] ERROR: could not create replication slot
"pg_16684_sync_49_7195149685251088378": ERROR: all replication slots
are in use
HINT: Free one or increase max_replication_slots.

[4428] LOG: shveta- LogicalRepSyncTableStart- worker_148 constructed
originname :pg_16684_49, relid:16540
[4428] LOG: could not drop replication slot
"pg_16684_sync_49_7195149685251088378" on publisher: ERROR:
replication slot "pg_16684_sync_49_7195149 685251088378" does not
exist
[4428] LOG: shveta- LogicalRepSyncTableStart- worker_148 drop-origin
originname:pg_16684_49
[4428] LOG: shveta- LogicalRepSyncTableStart- worker_148
updated-origin:pg_16684_49,
slot:pg_16684_sync_148_7195149685251088378, relid:16540

So from above, worker_49 came and picked up relid:16540, constructed
origin-name and slot-name and updated in system-catalog and then it
tried to actually create that slot and origin but since max-slot count
was reached, it failed and exited, but did not do cleanup from system
catalog for that relid.

Then worker_148 came and also picked up table 16540 since it was not
completed/started by previous worker, but this time it found that
origin and slot entry present in system-catalog against this relid, so
it picked the same names and started processing, but since those do
not exist, it asserted while advancing the origin.

The assert is only reproduced when an already running worker (say
worker_1) who has 'created=true' set, gets to sync the relid for which
a previously failed worker has tried and updated origin-name w/o
creating it. In such a case worker_1 (with created=true) will try to
reuse the origin and thus will try to advance it and will end up
asserting. That is why you might not see the assertion always. The
condition 'created' is set to true for that worker and it goes to
reuse the origin updated by the previous worker.

So to fix this, I think either we update origin and slot entries in
the system catalog after the creation has passed or we clean-up the
system catalog in case of failure. What do you think?

thanks
Shveta

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2023-02-02 04:24:15 Re: Weird failure with latches in curculio on v15
Previous Message Amit Kapila 2023-02-02 03:18:49 Re: Time delayed LR (WAS Re: logical replication restrictions)