Quick Links

RE: Fix slot synchronization with two_phase decoding enabled

From:	"Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>
Subject:	RE: Fix slot synchronization with two_phase decoding enabled
Date:	2025-04-28 08:56:52
Message-ID:	OS0PR01MB57163BC84F641BFB4082C89D94812@OS0PR01MB5716.jpnprd01.prod.outlook.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Fri, Apr 18, 2025 at 12:29 PM Amit Kapila wrote:
>
> On Thu, Apr 17, 2025 at 6:14 PM Zhijie Hou (Fujitsu) <houzj(dot)fnst(at)fujitsu(dot)com>
> >
> > -----
> > Fix
> > -----
> >
> > I think we should keep the confirmed_flush even if the previous synced
> > restart_lsn/catalog_xmin is newer. Attachments include a patch for the
> same.
> >
>
> This will fix the case we are facing but adds a new rule for slot synchronization.
> Can we think of a simpler way to fix this by avoiding updating other slot fields
> (like two_phase, two_phase_at) if restart_lsn or catalog_xmin of the local slot
> is ahead of the remote slot?

Since the fix for xmin advancement during fast-forward decoding has been pushed
(commit aaf9e95), I'm attaching the V2 patch to address the assert failure by
avoiding updating other slot fields (like two_phase, two_phase_at) if
restart_lsn or catalog_xmin of the local slot is ahead.

After commit aaf9e95, it's less likely for the synced slot's restart/catalog to
advance ahead of the source slot (The tests in
040_standby_failover_slots_sync.pl would not trigger it anymore). However, this
can still occur in certain scenarios.

Here is an example scenario with the steps involved:

1. Create a failover replication slot named 'logicalslot' on primary
and acquire it in the walsender.

2. Log two standby snapshots on primary:
- `0/3000428` - `running_xacts` (no running transactions)
- `0/3000460` - `running_xacts` (no running transactions)

3. The walsender sets `0/3000428` as the `candidate_restart_lsn`, skipping the
second `running_xacts` because `candidate_restart_lsn` is already valid.

4. After receiving a reply from the apply worker, the walsender assigns
`0/3000428` to `restart_lsn`. At this point, the replication slot
'logicalslot' has `restart_lsn` set to `0/3000428`.

5. On the standby, execute `pg_sync_replication_slots()` to synchronize
'logicalslot'.

6. During synchronization, with the initial `restart_lsn` at `0/3000428`, the
sync slot reaches a consistent point at this position. As a result, it does
not use this consistent point as `candidate_restart_lsn` (refer to
`SnapBuildProcessRunningXacts()`).

7. The sync process identifies the second standby snapshot at `0/3000460` and
uses its LSN as `candidate_restart_lsn`, eventually updating it to
`restart_lsn`.

8. Now, the synced slot holds `restart_lsn` at `0/3000460`, which is ahead of
the remote slot on the primary server.

So, if a user prepares a txn and changes the two_phase setting to true after
the steps mentioned above, the value for two_phase_at might get synced to the
standby while skipping the sync of the latest confirmed_flush_lsn. So I think
it's still necessary to fix the assert failure.

BTW, this fix is only for HEAD as back-branches do not sync
two_phase_at field.

Best Regards,
Hou zj

Attachment	Content-Type	Size
v2-0001-Fix-assertion-failure-when-decoding-synced-two-ph.patch	application/octet-stream	4.0 KB

In response to

Re: Fix slot synchronization with two_phase decoding enabled at 2025-04-18 04:28:57 from Amit Kapila

Responses

Re: Fix slot synchronization with two_phase decoding enabled at 2025-04-28 09:33:05 from shveta malik

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Fujii Masao	2025-04-28 09:06:59	Re: Questions about logicalrep_worker_launch()
Previous Message	Álvaro Herrera	2025-04-28 08:40:53	Re: using index to speedup add not null constraints to a table