RE: Potential data loss due to race condition during logical replication slot creation

From: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
To: 'Masahiko Sawada' <sawada(dot)mshk(at)gmail(dot)com>, "Callahan, Drew" <callaan(at)amazon(dot)com>
Cc: "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: RE: Potential data loss due to race condition during logical replication slot creation
Date: 2024-03-13 09:34:22
Message-ID: TYCPR01MB1207719C811F580A8774C79B7F52A2@TYCPR01MB12077.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Dear hackers,

While analyzing another failure [1], I found here. I think they occurred by the
same reason.

The reported failure occurred when the replication slot is created in the middle
of the transaction and it reuses the snapshot from other slot. The reproducer is:

```
Session0

SELECT pg_create_logical_replication_slot('slot0', 'test_decoding');
BEGIN;
INSERT INTO foo ...

Session1

SELECT pg_create_logical_replication_slot('slot1', 'test_decoding');

Session2

CHECKPOINT;
SELECT pg_logical_slot_get_changes('slot0', NULL, NULL);

Session0

INSERT INTO var ... // var is defined with (user_catalog_table = true)
COMMIT;

Session1
SELECT pg_logical_slot_get_changes('slot1', NULL, NULL);
-> Assertion failure.
```

> Here is the summary of several proposals we've discussed:
> a) Have CreateInitDecodingContext() always pass need_full_snapshot =
> true to AllocateSnapshotBuilder().

> b) Have snapbuild.c being able to handle multiple SnapBuildOnDisk versions.

> c) Add a global variable, say in_create, to snapbuild.c

Regarding three options raised by Sawada-san, I preferred the approach a).
Since the issue could happen for all supported branches, we should choose the
conservative approach. Also, it is quite painful if there are some codes for
handling the same issue.

Attached patch implemented the approach a) since no one made. I also added
the test which can do assertion failure, but not sure it should be included.

[1]: https://www.postgresql.org/message-id/TYCPR01MB1207717063D701F597EF98A0CF5272%40TYCPR01MB12077.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

Attachment Content-Type Size
master_0001-fix-snapbuild-bug-by-approach-a.patch application/octet-stream 13.2 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Ronan Dunklau 2024-03-13 09:37:21 Re: FSM Corruption (was: Could not read block at end of the relation)
Previous Message Kristo Marijo 2024-03-13 09:22:42 AW: BUG #18389: pg_database_owner not recognized with alter default privileges