Re: Forbid the use of invalidated physical slots in streaming replication.

From: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
To: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Forbid the use of invalidated physical slots in streaming replication.
Date: 2023-12-07 11:42:53
Message-ID: CAExHW5s1_MG6i9ykTLpFjnpmDsQG5WnriCLUVnCe==6uMsxFrw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Dec 6, 2023 at 6:25 PM Zhijie Hou (Fujitsu)
<houzj(dot)fnst(at)fujitsu(dot)com> wrote:
>
> Hi,
>
> When testing streaming replication with a physical slot. I found an unexpected
> behavior that the walsender could use an invalidated physical slot for
> streaming.
>
> This occurs when the primary slot is invalidated due to reaching the
> max_slot_wal_keep_size before initializing the streaming replication
> (e.g. before creating the standby). Attach a draft script(test_invalidated_slot.sh)
> which can reproduce this.

Interesting. Thanks for the script. It reproduces the problem for me easily.

>
> Once the slot is invalidated, it can no longer protect the WALs and
> Rows, as these invalidated slots are not considered in functions like
> ReplicationSlotsComputeRequiredXmin().
>
> Besides, the walsender could advance the restart_lsn of an invalidated slot,
> then user won't be able to know that if the slot is actually validated or not,
> because the 'conflicting' of view pg_replication_slot could be set back to
> null.

In this case, since the basebackup was taken after the slot was
invalidated, it does not require the WAL that was removed. But it
seems that once the streaming starts, the slot sprints to life again
and gets validated again. Here's pg_replication_slot output after the
standby starts
#select * from pg_replication_slots ;
slot_name | plugin | slot_type | datoid | database |
temporary | active | active_pid | xmin | catalog_xmin | restart_lsn |
confirmed_flush_lsn | wal_status | safe_wal_size | two_phase | conflic
ting
-------------+---------------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+---------------+-----------+--------
-----
logicalslot | test_decoding | logical | 5 | postgres | f
| f | | | 739 | |
0/1513B08 | lost | | f | t
physical | | physical | | | f
| t | 341925 | 752 | | 0/404CB78 |
| unreserved | 16462984 | f |
(2 rows)

which looks quite similar to the output when slot was valid after creation
slot_name | plugin | slot_type | datoid | database |
temporary | active | active_pid | xmin | catalog_xmin | restart_lsn |
confirmed_flush_lsn | wal_status | safe_wal_size | two_phase | conflic
ting
-------------+---------------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+---------------+-----------+--------
-----
logicalslot | test_decoding | logical | 5 | postgres | f
| f | | | 739 | 0/1513AD0 |
0/1513B08 | unreserved | -1591888 | f | f
physical | | physical | | | f
| f | | | | 0/14F0DF0 |
| unreserved | -1591888 | f |
(2 rows)

>
> So, I think it's a bug and one idea to fix is to check the validity of the physical slot in
> StartReplication() after acquiring the slot like what the attachment does,
> what do you think ?

I am not sure whether that's necessarily a bug. Of course, we don't
expect invalid slots to be used but in this case I don't see any harm.
The invalid slot has been revived and has all the properties set just
like a valid slot. So it must be considered in
ReplicationSlotsComputeRequiredXmin() as well. I haven't verified it
myself though. In case the WAL is really lost and is requested by the
standby it will throw an error "requested WAL segment [0-9A-F]+ has
already been removed". So no harm there as well.

I haven't been able to locate the code which makes the slot valid
though. So I can't say whether the behaviour is intended or not.
Looking at StartReplication() comment
/*
* We don't need to verify the slot's restart_lsn here; instead we
* rely on the caller requesting the starting point to use. If the
* WAL segment doesn't exist, we'll fail later.
*/
it looks like the behaviour is not completely unexpected.

--
Best Wishes,
Ashutosh Bapat

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2023-12-07 11:59:55 Re: Is WAL_DEBUG related code still relevant today?
Previous Message Amit Kapila 2023-12-07 11:36:50 Re: Synchronizing slots from primary to standby