Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

From: Melih Mutlu <m(dot)melihmutlu(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Date: 2022-07-27 10:26:10
Message-ID: CAGPVpCTKqBZxkAj4rTqwzdH034fUfwfUF1fHTTsNrt4+gkJJZQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Amit,

I updated the patch in order to prevent the problems that might be caused
by using different replication slots for syncing a table.
As suggested in previous emails, replication slot names are stored in the
catalog. So slot names can be reached later and it is ensured
that same replication slot is used during tablesync step of a table.

With the current version of the patch:
-. "srrelslotname" column is introduced into pg_subscibtion_rel catalog. It
stores the slot name for tablesync

-. Tablesync worker logic is now as follows:
1. Tablesync worker is launched by apply worker for a table.
2. Worker generates a default replication slot name for itself. Slot name
includes subid and worker pid for tracking purposes.
3. If table has a slot name value in the catalog:

i. If the table state is DATASYNC, drop the replication slot from the
catalog and proceed tablesync with a new slot.

ii. If the table state is FINISHEDCOPY, use the replicaton slot from the
catalog, do not create a new slot.

4. Before worker moves to new table, drop any replication slot that are
retrieved from the catalog and used.
5. In case of no table left to sync, drop the replication slot of that sync
worker with worker pid if it exists. (It's possible that a sync worker do
not create a replication slot for itself but uses slots read from the
catalog in each iteration)

I think using worker_pid also has similar risks of mixing slots from
> different workers because after restart same worker_pid could be
> assigned to a totally different worker. Can we think of using a unique
> 64-bit number instead? This will be allocated when each workers
> started for the very first time and after that we can refer catalog to
> find it as suggested in the idea below.
>

I'm not sure how likely to have colliding pid's for different tablesync
workers in the same subscription.
Though ,having pid in slot name makes it easier to track which slot belongs
to which worker. That's why I kept using pid in slot names.
But I think it should be simple to switch to using a unique 64-bit number.
So I can remove pid's from slot names, if you think that it would be
better.

> We should use the same for the origin name as well.
>

I did not really change anything related to origin names. Origin names are
still the same and include relation id. What do you think would be an issue
with origin names in this patch?

> This sounds tricky. Why not first drop slot/origin and then detach it
> from pg_subscription_rel? On restarts, it is possible that we may
> error out after dropping the slot or origin but before updating the
> catalog entry but in such case we can ignore missing slot/origin and
> detach them from pg_subscription_rel. Also, if we use the unique
> number as suggested above, I think even if we don't remove it after
> the relation state is ready, it should be okay.
>

Right, I did not add an additional slot cleanup step. The patch now drops
the slot when we're done with it and then removes it from the catalog.

Thanks,
Melih

Attachment Content-Type Size
v2-0001-Reuse-Logical-Replication-Background-worker.patch application/octet-stream 49.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2022-07-27 10:51:54 Re: Refactor to make use of a common function for GetSubscriptionRelations and GetSubscriptionNotReadyRelations.
Previous Message Michael Paquier 2022-07-27 10:25:47 Re: pgsql: Remove the restriction that the relmap must be 512 bytes.