Quick Links

Logical replication: lost updates/deletes and invalid log messages caused by SnapshotDirty + concurrent updates

From:	Mihail Nikalayeu <mihailnikalayeu(at)gmail(dot)com>
To:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Subject:	Logical replication: lost updates/deletes and invalid log messages caused by SnapshotDirty + concurrent updates
Date:	2025-09-10 23:56:00
Message-ID:	CADzfLwXZVmbo11tFS_G2i+6TfFVwHU4VUUSeoqb+8UQfuoJs8A@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hello, everyone!

Initially this was discussed in ("[BUG?]
check_exclusion_or_unique_constraint false negative")[0], but as Amit
recommended [1] I decided to write a dedicated email about it, since
the original thread is too deep and has some unrelated discussion.

Commitfest entry [2] contains reproducers for most of the issues
described below - both injection_point-based and pgbench-based (to
ensure such things may happen in real situations).

------

Source of the problem: any scan using SnapshotDirty/SnapshotSelf may
miss some logical tuple (as if it does not exist) in case of a
parallel update.

It happens like this:
* A SnapshotDirty heap scan starts
* Some page X is already processed by the scan, but page Y is not yet
* A logical row is updated by a concurrent transaction: the new
version is placed into page X (already visited), the old version
located in page Y (not-yet-visited) is marked as deleted
* Our scan finally visits page Y and skips the tuple because of the
xmax of the committed transaction
* Both the first and the second version of the tuple are skipped

So, in the MVCC view, the row exists all the time (just in different
versions), but from the perspective of SnapshotDirty it does not exist
at all at that moment.

The same may happen in a BTREE scan, even in the case of a single
index page. This is because the btree scan copies all TIDs from an
index page before starting to process them.
If, after taking such a copy, a new tuple version TID is inserted into
that page, the scan will not see it. But the "cached" TID may be
marked as deleted by a committed transaction, resulting in zero tuples
being detected.

------

Minor consequence: check_exclusion_or_unique_constraint efficiency

It may not find a record in a UNIQUE index during INSERT ON CONFLICT
UPDATE. However, this is just a minor performance issue - retry logic
takes care of it anyway.

------

Major consequence: weird behavior of logical replication in case of
parallel updates on the subscriber.

Caused by the fact that RelationFindReplTupleByIndex (or less likely
RelationFindReplTupleSeq) may miss a target row to perform an action
on, resulting in a skipped action.

---case 1: LOST DELETE---

Scenario: race of DELETE on publisher with UPDATE on subscriber
Setup:
On publisher:
CREATE TABLE t(a int PRIMARY key, data text);
INSERT INTO t(a, data) VALUES (1,'value_from_publisher');
On subscriber:
CREATE TABLE t(a int PRIMARY key, data text);
CREATE INDEX data_index ON t(data); -- index is required
to prevent HOT
Commands:
On publisher:
DELETE FROM t WHERE a=1;
On subscriber:
UPDATE t SET data = 'value_from_subscriber' WHERE (a=1);
Results:
Expected:
Tuple is deleted on both subscriber and publisher.
Both DELETE-before-UPDATE and UPDATE-before-DELETE result in
the absence of the row in the end.
Actual:
Either as expected, or:
Tuple is deleted on publisher, but 'value_from_subscriber'
remains on subscriber.

---case 2: LOST UPDATE---

Scenario: UPDATE on publisher vs subscriber-columns only UPDATE on subscriber
Setup:
On publisher:
CREATE TABLE t(a int PRIMARY key, data text);
INSERT INTO t(a, data) VALUES (1,'initial_data');
On subscriber:
-- note additional subscriber-only column
CREATE TABLE t(a int PRIMARY key, data text, sub_only
varchar(40) DEFAULT 'default_data');
CREATE INDEX sub_only_index ON t(sub_only);
Commands:
On publisher:
UPDATE t SET data = 'update_from_publisher' WHERE (a=1);
On subscriber:
UPDATE t SET sub_only = 'update_from_subscriber' WHERE (a=1);
Results:
Expected:
On subscriber: tuple (a=1, data='update_from_publisher',
sub_only='update_from_subscriber').
Update on subscriber only affects the subscriber-only column,
so I expect publisher-driven columns are replicated.
Actual:
Either as expected, or:
Publisher update is lost, leaving (a=1, data='initial_data',
sub_only='update_from_subscriber') on subscriber.

------

Normal consequence:

For the same reasons, logical replication may provide invalid log
messages (and stats) about conflicts:

* correct: update_origin_differs, in logs: update_missing
* correct: update_origin_differs, in logs: update_deleted (providing
invalid deleted tuple data)
* correct: delete_origin_differs, in logs: delete_missing

This is not a critical issue, but invalid logs are something no one
wants to deal with, especially logs made to debug conflicts.
Also, in case of any conflict-resolution mechanics in the future, it
may be a critical flaw.

------

Proposed fix:

The patch in [2] addresses the issue by rewriting
RelationFindReplTupleByIndex/RelationFindReplTupleSeq to use
GetLatestSnapshot for each attempt to find the target row.
Since it calls GetLatestSnapshot before table_tuple_lock anyway, no
performance regression is expected.

As a result:
* MVCC scan will not miss updated tuples, while DirtyScan may
* MVCC scan cannot see not-yet-committed new rows, while DirtyScan
can, however in both variants, table_tuple_lock will wait for the
updating transaction to commit before retrying the whole scan

In any case I think we should at least document such behavior.

Best regards,
Mikhail.

[0]: https://www.postgresql.org/message-id/flat/CANtu0oiktqQ2pwExoXqDpByXNCJa-KE5vQRodTRnmFHN_%2BqwHg%40mail.gmail.com
[1]: https://www.postgresql.org/message-id/flat/CAA4eK1LZxzORgAoDhix9MWrOqYOsNZuZLW2sTfGsJFM99yRgrg%40mail.gmail.com#02be86f7e2d24a038878f03ac1b93e95
[2]: https://commitfest.postgresql.org/patch/5151/

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Alexandra Wang	2025-09-10 23:56:40	Re: SQL:2023 JSON simplified accessor support
Previous Message	Michael Paquier	2025-09-10 23:48:47	Re: Stale comment in guc.h; publish listing of setting sources?