Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

From: shveta malik <shveta(dot)malik(at)gmail(dot)com>
To: Melih Mutlu <m(dot)melihmutlu(at)gmail(dot)com>
Cc: "wangw(dot)fnst(at)fujitsu(dot)com" <wangw(dot)fnst(at)fujitsu(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Subject: Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Date: 2023-02-01 12:00:50
Message-ID: CAJpy0uBj2c6AyXPq9ZDvG4Tx8f_Jwb6WP0P9rVgpjDq+XLw4+g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 1, 2023 at 5:05 PM Melih Mutlu <m(dot)melihmutlu(at)gmail(dot)com> wrote:
>
> Hi,
>
> Please see attached patches for the below changes.
>

> Thanks for reviewing,
> --
> Melih Mutlu
> Microsoft

Hello Melih,
Thank you for making the changes.

I have few more comments:
1)
src/backend/replication/logical/worker.c: (errmsg("logical replication
table synchronization worker for subscription \"%s\", table \"%s\" has
started",
src/backend/replication/logical/worker.c: (errmsg("logical replication
table synchronization worker for subscription \"%s\" has moved to sync
table \"%s\".",
src/backend/replication/logical/tablesync.c: (errmsg("logical
replication table synchronization worker for subscription \"%s\",
table \"%s\" has finished",

In above can we have rep_slot_id as well in trace message, else it is
not clear which worker moved to next relation. We may have:
logical replication table synchronization worker_%d for subscription
\"%s\" has moved to sync table, rep_slot_id,....

Overall we need to improve the tracing. I will give my suggestions on
this later (in detail).

2) I found a crash in the previous patch (v9), but have not tested it
on the latest yet. Crash happens when all the replication slots are
consumed and we are trying to create new. I tweaked the settings like
below so that it can be reproduced easily:
max_sync_workers_per_subscription=3
max_replication_slots = 2
and then ran the test case shared by you. I think there is some memory
corruption happening. (I did test in debug mode, have not tried in
release mode). I tried to put some traces to identify the root-cause.
I observed that worker_1 keeps on moving from 1 table to another table
correctly, but at some point, it gets corrupted i.e. origin-name
obtained for it is wrong and it tries to advance that and since that
origin does not exist, it asserts and then something else crashes.
From log: (new trace lines added by me are prefixed by shveta, also
tweaked code to have my comment 1 fixed to have clarity on worker-id).

form below traces, it is clear that worker_1 was moving from one
relation to another, always getting correct origin 'pg_16688_1', but
at the end it got 'pg_16688_49' which does not exist. Second part of
trace shows who updated 'pg_16688_49', it was done by worker_49 which
even did not get chance to create this origin due to max_rep_slot
reached.
==============================
..............
2023-02-01 16:01:38.041 IST [9243] LOG: logical replication table
synchronization worker_1 for subscription "mysub", table "table_93"
has finished
2023-02-01 16:01:38.047 IST [9243] LOG: logical replication table
synchronization worker_1 for subscription "mysub" has moved to sync
table "table_98".
2023-02-01 16:01:38.055 IST [9243] LOG: shveta-
LogicalRepSyncTableStart- worker_1 get-origin-id2 originid:2,
originname:pg_16688_1
2023-02-01 16:01:38.055 IST [9243] LOG: shveta-
LogicalRepSyncTableStart- Worker_1 reusing
slot:pg_16688_sync_1_7195132648087016333, originid:2,
originname:pg_16688_1
2023-02-01 16:01:38.094 IST [9243] LOG: shveta-
LogicalRepSyncTableStart- worker_1 updated-origin2
originname:pg_16688_1
2023-02-01 16:01:38.096 IST [9243] LOG: logical replication table
synchronization worker_1 for subscription "mysub", table "table_98"
has finished
2023-02-01 16:01:38.096 IST [9243] LOG: logical replication table
synchronization worker_1 for subscription "mysub" has moved to sync
table "table_60".
2023-02-01 16:01:38.099 IST [9243] LOG: shveta-
LogicalRepSyncTableStart- worker_1 get-origin originid:0,
originname:pg_16688_49
2023-02-01 16:01:38.099 IST [9243] LOG: could not drop replication
slot "pg_16688_sync_49_7195132648087016333" on publisher: ERROR:
replication slot "pg_16688_sync_49_7195132648087016333" does not exist
2023-02-01 16:01:38.103 IST [9243] LOG: shveta-
LogicalRepSyncTableStart- Worker_1 reusing
slot:pg_16688_sync_1_7195132648087016333, originid:0,
originname:pg_16688_49
TRAP: failed Assert("node != InvalidRepOriginId"), File: "origin.c",
Line: 892, PID: 9243
postgres: logical replication worker for subscription 16688 sync 16384
(ExceptionalCondition+0xbb)[0x56019194d3b7]
postgres: logical replication worker for subscription 16688 sync 16384
(replorigin_advance+0x6d)[0x5601916b53d4]
postgres: logical replication worker for subscription 16688 sync 16384
(LogicalRepSyncTableStart+0xbb4)[0x5601916cb648]
postgres: logical replication worker for subscription 16688 sync 16384
(+0x5d25e2)[0x5601916d35e2]
postgres: logical replication worker for subscription 16688 sync 16384
(+0x5d282c)[0x5601916d382c]
postgres: logical replication worker for subscription 16688 sync 16384
(ApplyWorkerMain+0x17b)[0x5601916d4078]
postgres: logical replication worker for subscription 16688 sync 16384
(StartBackgroundWorker+0x248)[0x56019167f943]
postgres: logical replication worker for subscription 16688 sync 16384
(+0x589ad3)[0x56019168aad3]
postgres: logical replication worker for subscription 16688 sync 16384
(+0x589ee3)[0x56019168aee3]
postgres: logical replication worker for subscription 16688 sync 16384
(+0x588d8d)[0x560191689d8d]
postgres: logical replication worker for subscription 16688 sync 16384
(+0x584604)[0x560191685604]
postgres: logical replication worker for subscription 16688 sync 16384
(PostmasterMain+0x14f1)[0x560191684f1e]
postgres: logical replication worker for subscription 16688 sync 16384
(+0x446e05)[0x560191547e05]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f048cc58d90]
==============================

'pg_16688_49' updated by worker_49:
2023-02-01 16:01:37.083 IST [9348] LOG: shveta-
LogicalRepSyncTableStart- worker_49 get-origin originid:0,
originname:pg_16688_49
2023-02-01 16:01:37.083 IST [9348] LOG: shveta-
LogicalRepSyncTableStart- worker_49 updated-origin
originname:pg_16688_49
2023-02-01 16:01:37.083 IST [9348] LOG: shveta-
LogicalRepSyncTableStart- worker_49 get-origin-id2 originid:0,
originname:pg_16688_49
2023-02-01 16:01:37.083 IST [9348] ERROR: could not create
replication slot "pg_16688_sync_49_7195132648087016333": ERROR: all
replication slots are in use
HINT: Free one or increase max_replication_slots.
==============================

Rest of the workers keep on exiting and getting recreated since they
could not create slot. The last_used_id kept on increasing on every
restart of subscriber until we kill it. In my case it reached 2k+.

thanks
Shveta

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2023-02-01 12:01:26 Re: Progress report of CREATE INDEX for nested partitioned tables
Previous Message Amit Kapila 2023-02-01 12:00:23 Re: Perform streaming logical transactions by background workers and parallel apply