Re: Copy function for logical replication slots

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Copy function for logical replication slots
Date: 2018-07-05 08:24:48
Message-ID: CAD21AoCKK2wh269J40kL2mwmAf2MgLYfPC5cy71NSOGVRu2Xeg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jul 5, 2018 at 4:47 PM, Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> On Tue, Jul 03, 2018 at 05:27:07PM +0900, Masahiko Sawada wrote:
>> On Tue, Jul 3, 2018 at 1:01 PM, Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>>> One property which seems important to me is to make sure that the target
>>> slot has the same data as the origin slot once the caller knows that the
>>> copy has completed, and not that the target slot may perhaps have the
>>> same data as the origin while creating the target. Do you see the
>>> difference? Your patch does the latter, because it creates the new slot
>>> after releasing the lock it held on the origin, while I would think that
>>> the former is more important, aka keep the lock for the duration of the
>>> copy.
>>
>> Did you mean "the caller" is clients who call
>> pg_copy_logical_replication_slot function? If so, I think it's very
>> difficult to ensure the former because the origin slot can advance
>> during returning the result to the client. Also if we keep the lock on
>> the origin slot during the copy I think that one problem is that we
>> will own two replication slots at a time. But there are a lot of code
>> that premise a process holds at most one replication slot.
>
> When I mean the caller, I mean the client in charge of the slot
> creation. Anyway, this bit is really worrying me in your patch:
> - ReplicationSlotReserveWal();
> + /* Find start location to read WAL if not specified */
> + if (XLogRecPtrIsInvalid(start_lsn))
> + ReplicationSlotReserveWal();
> + else
> + {
> + SpinLockAcquire(&slot->mutex);
> + slot->data.restart_lsn = start_lsn;
> + SpinLockRelease(&slot->mutex);
> + }
> Your patch simply increases the window mentioned here in slot.c because
> there is no interlock between the checkpointer and the process creating
> the slot. Please see here:
> /*
> * The replication slot mechanism is used to prevent removal of required
> * WAL. As there is no interlock between this routine and checkpoints, WAL
> * segments could concurrently be removed when a now stale return value of
> * ReplicationSlotsComputeRequiredLSN() is used. In the unlikely case that
> * this happens we'll just retry.
> */
> So, slots created without a non-arbitrary start position have at least
> the will to restart if a checkpointer is running in parallel, but your
> patch increases the window used, and does not even retry nor does it
> fail to create the slot if the restart LSN points to a segment not here
> anymore. The trick I think here is that the slot copy should not cause
> checkpoint slowdowns, so more thoughts are needed here, and this is not
> really trivial.

Yes, you're right. To guarantee that restart LSN of copied slot is
available, it seems to me that it's better to copy new slot while
holding the origin slot as you mentioned before. Since the replication
slot creation code assumes that a process creating a new slot doesn't
have any slots we should save origin slot temporary and create new
one, and then restore it. It might be a bit tricky but would work
fine.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Srinivas Karthik V 2018-07-05 08:48:49 Re: Bulk Insert into PostgreSQL
Previous Message Pavel Stehule 2018-07-05 08:18:50 Re: Unexpected behavior of DROP VIEW/TABLE IF EXISTS