Potential data loss due to race condition during logical replication slot creation

From: "Callahan, Drew" <callaan(at)amazon(dot)com>
To: "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Cc: "sawada(dot)mshk(at)gmail(dot)com" <sawada(dot)mshk(at)gmail(dot)com>
Subject: Potential data loss due to race condition during logical replication slot creation
Date: 2024-02-01 22:28:44
Message-ID: 29273AF3-9684-4069-9257-D05090B03B99@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hello,

We discovered a race condition during logical replication slot creation that can result in the changes for transactions running at the time of the slot creation to only be partially replicated. We found the cause was due to the slot transitioning from an inconsistent or partially consistent state to a fully consistent state when restoring a snapshot that had been persisted to disk by a different logical slot. We provide a simple reproduction of this issue below:

Session 1:

SELECT pg_create_logical_replication_slot('slot1', 'test_decoding');

CREATE TABLE test (a int);
BEGIN;
INSERT INTO test VALUES (1);

Session 2:

SELECT pg_create_logical_replication_slot('slot2', 'test_decoding');

<query hangs>

Session 3:

CHECKPOINT;

select pg_logical_slot_get_changes('slot1', NULL, NULL);

<should return nothing of interest>

Session 1:

INSERT INTO test VALUES (2);
COMMIT;

<Session 2 query no longer hangs and successfully creates the slot2>

Session 2:

select pg_logical_slot_get_changes('slot1', NULL, NULL);

select pg_logical_slot_get_changes('slot2', NULL, NULL);

<expected: no rows of the txn are returned for slot2>
<actual: The 2nd row of the txn is returned for slot2>

Newly created logical replication slots initialize their restart LSN to the current insert position within the WAL and also force a checkpoint to get the current state of the running transactions on the system. This create process will then wait for all of the transactions within that running xact record to complete before being able to transition to the next snapbuild state. During this time period, if another running xact record is written and then a different logical replication process decodes this running xact record, a globally accessible snapshot will be persisted to disk.

Once all of the transactions from the initial running xact have finished, the process performing the slot creation will become unblocked and will then consume the new running xact record. The process will see a valid snapshot, restore that snapshot from disk, and then transition immediately to the consistent state. The slot will then set the confirmed flush LSN of the slot to the start of the next record after that running xact.

We now have a logical replication slot that has a restart LSN after the changes of transactions that will commit after our confirmed flushed LSN in any case where the two running xact records are not fully disjointed. Once those transactions commit, we will then partially stream their changes.

The attached fix addresses the issue by providing the snapshot builder with enough context to understand whether it is building a snapshot for a slot for the first time or if this is a previously existing slot. This is similar to the “need_full_snapshot” parameter which is already set by the caller to control when and how the snapbuilder is allowed to become consistent.

With this context, the snapshot builder can always skip performing snapshot restore in order to become fully consistent. Since this issue only occurs when the the logical replication slot consumes a persisted snapshot to become fully consistent we can prevent the issue by disallowing this behavior.

Thanks,
Drew

Attachment Content-Type Size
skip_snap_restore.patch application/octet-stream 3.3 KB

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2024-02-01 22:34:29 Re: BUG #18322: pg_dump fails with "incorrect version found" (with no good reason)
Previous Message PG Bug reporting form 2024-02-01 12:47:28 BUG #18322: pg_dump fails with "incorrect version found" (with no good reason)