Quick Links

Re: Forbid the use of invalidated physical slots in streaming replication.

From:	Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
To:	"Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Forbid the use of invalidated physical slots in streaming replication.
Date:	2023-12-07 11:42:53
Message-ID:	CAExHW5s1_MG6i9ykTLpFjnpmDsQG5WnriCLUVnCe==6uMsxFrw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Dec 6, 2023 at 6:25 PM Zhijie Hou (Fujitsu)
<houzj(dot)fnst(at)fujitsu(dot)com> wrote:
>
> Hi,
>
> When testing streaming replication with a physical slot. I found an unexpected
> behavior that the walsender could use an invalidated physical slot for
> streaming.
>
> This occurs when the primary slot is invalidated due to reaching the
> max_slot_wal_keep_size before initializing the streaming replication
> (e.g. before creating the standby). Attach a draft script(test_invalidated_slot.sh)
> which can reproduce this.

Interesting. Thanks for the script. It reproduces the problem for me easily.

>
> Once the slot is invalidated, it can no longer protect the WALs and
> Rows, as these invalidated slots are not considered in functions like
> ReplicationSlotsComputeRequiredXmin().
>
> Besides, the walsender could advance the restart_lsn of an invalidated slot,
> then user won't be able to know that if the slot is actually validated or not,
> because the 'conflicting' of view pg_replication_slot could be set back to
> null.

>
> So, I think it's a bug and one idea to fix is to check the validity of the physical slot in
> StartReplication() after acquiring the slot like what the attachment does,
> what do you think ?

I am not sure whether that's necessarily a bug. Of course, we don't
expect invalid slots to be used but in this case I don't see any harm.
The invalid slot has been revived and has all the properties set just
like a valid slot. So it must be considered in
ReplicationSlotsComputeRequiredXmin() as well. I haven't verified it
myself though. In case the WAL is really lost and is requested by the
standby it will throw an error "requested WAL segment [0-9A-F]+ has
already been removed". So no harm there as well.

I haven't been able to locate the code which makes the slot valid
though. So I can't say whether the behaviour is intended or not.
Looking at StartReplication() comment
/*
* We don't need to verify the slot's restart_lsn here; instead we
* rely on the caller requesting the starting point to use. If the
* WAL segment doesn't exist, we'll fail later.
*/
it looks like the behaviour is not completely unexpected.

--
Best Wishes,
Ashutosh Bapat

In response to

Forbid the use of invalidated physical slots in streaming replication. at 2023-12-06 12:55:29 from Zhijie Hou (Fujitsu)

Responses

RE: Forbid the use of invalidated physical slots in streaming replication. at 2023-12-07 13:00:35 from Zhijie Hou (Fujitsu)

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Bharath Rupireddy	2023-12-07 11:59:55	Re: Is WAL_DEBUG related code still relevant today?
Previous Message	Amit Kapila	2023-12-07 11:36:50	Re: Synchronizing slots from primary to standby