Re: Physical replication slot advance is not persistent

From: Alexey Kondratov <a(dot)kondratov(at)postgrespro(dot)ru>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Dimitri Fontaine <dim(at)tapoueh(dot)org>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, simon(at)2ndquadrant(dot)com
Subject: Re: Physical replication slot advance is not persistent
Date: 2020-06-09 18:01:13
Message-ID: 82163ef4a380c5255e5004659d02559e@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2020-06-09 20:19, Andres Freund wrote:
> Hi,
>
> On 2020-01-28 17:01:14 +0900, Michael Paquier wrote:
>> So attached is an updated patch which addresses the problem just by
>> marking a physical slot as dirty if any advancing is done. Some
>> documentation is added about the fact that an advance is persistent
>> only at the follow-up checkpoint. And the tests are fixed to not use
>> a fake LSN but instead advance to the latest LSN position produced.
>
>>
>> - /* Update the on disk state when lsn was updated. */
>> - if (XLogRecPtrIsInvalid(endlsn))
>> - {
>> - ReplicationSlotMarkDirty();
>> - ReplicationSlotsComputeRequiredXmin(false);
>> - ReplicationSlotsComputeRequiredLSN();
>> - ReplicationSlotSave();
>> - }
>> -
>
> I am quite confused by the wholesale removal of these lines. That
> wasn't
> in previous versions of the patch. As far as I can tell not calling
> ReplicationSlotsComputeRequiredLSN() for the physical slot leads to the
> global minimum LSN never beeing advanced, and thus WAL reserved by the
> slot not being removable. Only if there's some independent call to
> ReplicationSlotsComputeRequiredLSN() XLogSetReplicationSlotMinimumLSN()
> will be called, allowing for slots to advance.
>
> I realize this stuff has been broken since the introduction in
> 9c7d06d6068 (due to the above being if (XLogRecPtrIsInvalid()) rather
> than if (!XLogRecPtrIsInvalid()) , but this seems to make it even
> worse?
>

Yes, there was a ReplicationSlotsComputeRequiredLSN() call inside
pg_physical_replication_slot_advance() in the v5 of the patch:

@@ -370,6 +370,11 @@ pg_physical_replication_slot_advance(XLogRecPtr
moveto)
MyReplicationSlot->data.restart_lsn = moveto;
SpinLockRelease(&MyReplicationSlot->mutex);
retlsn = moveto;
+
+ ReplicationSlotMarkDirty();
+
+ /* We moved retart_lsn, update the global value. */
+ ReplicationSlotsComputeRequiredLSN();

But later it has been missed and we have not noticed that.

I think that adding it back as per attached will be enough.

>
> I find it really depressing how much obviously untested stuff gets
> added in this area.
>

Prior to this patch pg_replication_slot_advance was not being tested at
all. Unfortunately, added tests appeared to be not enough to cover all
cases. It seems that the whole machinery of WAL holding and trimming is
worth to be tested more thoroughly.

Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

Attachment Content-Type Size
fix-ph-repl-slot-advance-lsn-recompute.diff text/x-diff 545 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2020-06-09 18:32:51 Re: Disallow cancellation of waiting for synchronous replication
Previous Message Tom Lane 2020-06-09 17:59:44 Re: elog(DEBUG2 in SpinLocked section.