Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

From: Melih Mutlu <m(dot)melihmutlu(at)gmail(dot)com>
To: shveta malik <shveta(dot)malik(at)gmail(dot)com>
Cc: "wangw(dot)fnst(at)fujitsu(dot)com" <wangw(dot)fnst(at)fujitsu(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, shiy(dot)fnst(at)fujitsu(dot)com, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Date: 2023-02-01 12:12:19
Message-ID: CAGPVpCSNExJ3tgK8QgnNUb1QVGvJNprW7LWJ-8fWfGKgtcittw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Shveta,

shveta malik <shveta(dot)malik(at)gmail(dot)com>, 1 Şub 2023 Çar, 15:01 tarihinde şunu
yazdı:

> On Wed, Feb 1, 2023 at 5:05 PM Melih Mutlu <m(dot)melihmutlu(at)gmail(dot)com> wrote:
> 2) I found a crash in the previous patch (v9), but have not tested it
> on the latest yet. Crash happens when all the replication slots are
> consumed and we are trying to create new. I tweaked the settings like
> below so that it can be reproduced easily:
> max_sync_workers_per_subscription=3
> max_replication_slots = 2
> and then ran the test case shared by you. I think there is some memory
> corruption happening. (I did test in debug mode, have not tried in
> release mode). I tried to put some traces to identify the root-cause.
> I observed that worker_1 keeps on moving from 1 table to another table
> correctly, but at some point, it gets corrupted i.e. origin-name
> obtained for it is wrong and it tries to advance that and since that
> origin does not exist, it asserts and then something else crashes.
> From log: (new trace lines added by me are prefixed by shveta, also
> tweaked code to have my comment 1 fixed to have clarity on worker-id).
>
> form below traces, it is clear that worker_1 was moving from one
> relation to another, always getting correct origin 'pg_16688_1', but
> at the end it got 'pg_16688_49' which does not exist. Second part of
> trace shows who updated 'pg_16688_49', it was done by worker_49 which
> even did not get chance to create this origin due to max_rep_slot
> reached.
>

Thanks for investigating this error. I think it's the same error as the one
Shi reported earlier. [1]
I couldn't reproduce it yet but will apply your tweaks and try again.
Looking into this.

[1]
https://www.postgresql.org/message-id/OSZPR01MB631013C833C98E826B3CFCB9FDC69%40OSZPR01MB6310.jpnprd01.prod.outlook.com

Thanks,
--
Melih Mutlu
Microsoft

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message adherent postgres 2023-02-01 12:24:11 About PostgreSQL Core Team
Previous Message Melih Mutlu 2023-02-01 12:07:25 Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication