Re: Improve conflict detection when replication origins are reused

From: shveta malik <shveta(dot)malik(at)gmail(dot)com>
To: Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, shveta malik <shvetamalik(at)gmail(dot)com>
Subject: Re: Improve conflict detection when replication origins are reused
Date: 2026-05-15 03:26:42
Message-ID: CAJpy0uAW=7WhkmeZYxB9zoGsk1G5itdJQC6srvpqz9hhKF9xoA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, May 14, 2026 at 8:35 AM Nisha Moond <nisha(dot)moond412(at)gmail(dot)com> wrote:
>
> Hi hackers,
>
> While reviewing the issue reported at [1] and the proposed solutions
> at [2], I noticed a related problem: false negative conflict detection
> when a 'ReplOriginId' gets reused.
>
> In logical replication, conflict detection relies on the tuple’s
> replication origin ('roident'). The problem is that if a subscription
> is dropped and a new subscription later reuses the same origin ID, the
> apply worker may incorrectly treat incoming changes as “its own”
> changes and skip conflict detection.
>
> A simple example:
> 1. Create subscription sub1 with 'roident = 1'
> 2. Replicate some rows into table 't1'
> 3. Drop 'sub1'
> 4. Create another subscription 'sub2'
> 5. `sub2` reuses 'roident = 1'
> 6. New updates arrive for rows previously written by 'sub1'
> At this point, conflict detection sees:
> tuple_origin == current_origin
>
> and incorrectly assumes the row was written by the current
> subscription instance, so no 'update_origin_differ' conflict is
> raised.

I agree with the problem sattement. I will prioritize the review soon.

> This may look harmless in this simple setup, but it becomes
> problematic if the new subscription is connected to a different
> publisher, because real conflicts can then be silently missed.
>
> I explored two possible approaches to solve this:
>
> Approach 1. Zero out old origin IDs in commit_ts data when dropping a
> subscription
> ----------------------
> - When a subscription is dropped and its replication origin becomes
> free, scan all 'commit_ts' SLRU entries and replace that old origin ID
> with 'InvalidRepOriginId (0)'.
> - So rows previously written by the old subscription would no longer
> appear to belong to any active replication origin.
> - A new subscription reusing the same 'roident' will always conflict
> with origin '0'.
>
> Pros:
> - Fixes the stale-origin problem completely and may also help solve
> the tablesync-origin issue discussed in [1]
> - No additional checks needed during conflict detection
>
> Cons:
> - Requires scanning the entire 'commit_ts' SLRU during DROP
> SUBSCRIPTION, so it can become very expensive on large systems
> - Not crash-safe currently(patch):
> - if the server crashes midway, some entries may still contain the
> old origin ID
> - after restart, reused origins can again lead to missed conflicts
> - Making this fully crash-safe would likely require WAL logging or
> recovery-time reprocessing.
>
> Approach 2. Store replication origin creation time
> ----------------------
> - Add a creation timestamp for each replication origin
> - During conflict check:
> if tuple_origin != current_origin
> -> existing behavior
> if tuple_origin == current_origin
> -> compare tuple commit timestamp with origin creation time
> if tuple_commit_ts <= origin_creation_time
> -> treat as an origin reuse case and raise conflict
>
> Pros:
> -------
> - No additional processing during DROP SUBSCRIPTION
> - Lightweight runtime check (just one timestamp comparison)
> - Naturally crash-safe since origin creation is WAL-logged already
>
> Cons:
> - Requires a catalog schema change
> - The <= comparison can produce false-positive conflicts for rows
> committed at the exact same microsecond as origin creation
> - May require additional handling for upgraded origins
>
> IMO, the second approach currently looks more practical because it
> avoids the heavy SLRU scan and crash-recovery complexity.
>
> Attached:
> - Patch for approach 1
> - Patch for approach 2
> - A TAP test reproducing the issue
>
> Note: The patches are manually tested for the reported issue, but not
> yet tested for performance or additional edge cases.
>
> Feedback and suggestions are welcome.
>
> [1] https://www.postgresql.org/message-id/CALDaNm3Y6Y4Mub6QC8fZKnNy5jZspELQYCoQF_FL2Zwzweu%3Dog%40mail.gmail.com
> [2] https://www.postgresql.org/message-id/CAA4eK1LxGXR7jOAKh0B8N362S-Q3b6GhBxxcV_HxUaicEPq5Cg%40mail.gmail.com
>
> --
> Thanks,
> Nisha

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ayush Tiwari 2026-05-15 03:42:11 Re: Add a guard against uninitialized default locale
Previous Message Chao Li 2026-05-15 03:08:54 Re: Should IGNORE NULLS cache nullness for volatile arguments?