Re: [PATCH] Allow to specify restart_lsn in pg_create_physical_replication_slot()

From: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To: Alexey Kondratov <a(dot)kondratov(at)postgrespro(dot)ru>, Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Vyacheslav Makarov <v(dot)makarov(at)postgrespro(dot)ru>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: [PATCH] Allow to specify restart_lsn in pg_create_physical_replication_slot()
Date: 2020-06-19 18:57:16
Message-ID: 8dec8b6f-d3c0-6323-7653-91da75aabdb9@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2020/06/19 23:20, Alexey Kondratov wrote:
> On 2020-06-19 03:59, Michael Paquier wrote:
>> On Thu, Jun 18, 2020 at 03:39:09PM +0300, Vyacheslav Makarov wrote:
>>> If the WAL segment for the specified restart_lsn (STOP_LSN of the backup)
>>> exists, then the function will create a physical replication slot and will
>>> keep all the WAL segments required by the replica to catch up with the
>>> primary. Otherwise, it returns error, which means that the required WAL
>>> segments have been already utilised, so we do need to take a new backup.
>>> Without passing this newly added parameter
>>> pg_create_physical_replication_slot() works as before.
>>>
>>> What do you think about this?

Currently pg_create_physical_replication_slot() and CREATE_REPLICATION_SLOT
replication command seem to be "idential". So if we add new option into one,
we should add it also into another?

What happen if future LSN is specified in restart_lsn? With the patch,
in this case, if the segment at that LSN exists (e.g., because it's recycled
one), the slot seems to be successfully created. However if the LSN is
far future and the segment doesn't exist, the creation of slot seems to fail.
This behavior looks fragile and confusing. We should accept future LSN
whether its segment currently exists or not?

+ if (!RecoveryInProgress() && !SlotIsLogical(MyReplicationSlot))

With the patch, the given restart_lsn seems to be ignored during recovery.
Why?

>>
>> I think that this was discussed in the past (perhaps one of the
>> threads related to WAL advancing actually?),
>>
>
> I have searched through the archives a bit and found one thread related to slots advancing [1]. It was dedicated to a problem of advancing slots which do not reserve WAL yet, if I get it correctly. Although it is somehow related to the topic, it was a slightly different issue, IMO.
>
>>
>> and this stuff is full of
>> holes when it comes to think about error handling with checkpoints
>> running in parallel, potentially doing recycling of segments you would
>> expect to be around based on your input value for restart_lsn *while*
>> pg_create_physical_replication_slot() is still running and
>> manipulating the on-disk slot information. I suspect that this also
>> breaks a couple of assumptions behind concurrent calls of the minimum
>> LSN calculated across slots when a caller sees fit to recompute the
>> thresholds (WAL senders mainly here, depending on the replication
>> activity).
>>
>
> These are the right concerns, but all of them should be applicable to the pg_create_physical_replication_slot() + immediately_reserve == true in the same way, doesn't it? I think so, since in that case we are doing a pretty similar thing — trying to reserve some WAL segment that may be concurrently deleted.
>
> And this is exactly the reason why ReplicationSlotReserveWal() does it in several steps in a loop:
>
> 1. Creates a slot with some restart_lsn.
> 2. Does ReplicationSlotsComputeRequiredLSN() to prevent removal of the WAL segment with this restart_lsn.
> 3. Checks that required WAL segment is still there.
> 4. Repeat if this attempt to prevent WAL removal has failed.

What happens if concurrent checkpoint decides to remove the segment
at restart_lsn before #2 and then actually removes it after #3?
The replication slot is successfully created with the given restart_lsn,
but the reserved segment has already been removed?

> I guess that the only difference in the case of proposed scenario is that we do not have a chance for step 4, since we do need some specific restart_lsn, not any recent restart_lsn, i.e. in this case we have to:
>
> 1. Create a slot with restart_lsn specified by user.
> 2. Do ReplicationSlotsComputeRequiredLSN() to prevent WAL removal.
> 3. Check that required WAL segment is still there and report ERROR to the user if it is not.

The similar situation as the above may happen.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2020-06-19 19:14:45 Re: Missing HashAgg EXPLAIN ANALYZE details for parallel plans
Previous Message Robert Haas 2020-06-19 18:11:25 Re: Creating a function for exposing memory usage of backend process