| From: | Antonin Houska <ah(at)cybertec(dot)at> |
|---|---|
| To: | pgsql-hackers(at)lists(dot)postgresql(dot)org |
| Subject: | Race conditions in logical decoding |
| Date: | 2026-01-19 16:29:25 |
| Message-ID: | 85833.1768840165@localhost |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
A stress test [1] for the REPACK patch [1] revealed data
corruption. Eventually I found out that the problem is in postgres core. In
particular, it can happen that a COMMIT record is decoded, but before the
commit could be recorded in CLOG, a snapshot that takes the commit into
account is created and even used. Visibility checks then work incorrectly
until the CLOG gets updated.
In logical replication, the consequences are not only wrong data on the
subscriber, but also corrutped table on publisher - this is due to incorrectly
set commit hint bits.
Attached is a spec file that demonstrates the issue. I did not add it to
Makefile because I don't expect the current version to be merged (see the
commit message for details.
I'm not sure yet how to fix the problem. I tried to call XactLockTableWait()
from SnapBuildAddCommittedTxn() (like it happens in SnapBuildWaitSnapshot()),
but it made at least one regression test (subscription/t/010_truncate.pl)
stuck - probably a deadlock. I can spend more time on it, but maybe someone
can come up with a good idea sooner than me.
[1] https://www.postgresql.org/message-id/CADzfLwU78as45To9a%3D-Qkr5jEg3tMxc5rUtdKy2MTv4r_SDGng%40mail.gmail.com
[2] https://commitfest.postgresql.org/patch/5117/
--
Antonin Houska
Web: https://www.cybertec-postgresql.com
| Attachment | Content-Type | Size |
|---|---|---|
| 0001-Demonstrate-possible-race-conditions-in-logical-decoding.patch | text/x-diff | 11.6 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Antonin Houska | 2026-01-19 16:31:19 | Re: Adding REPACK [concurrently] |
| Previous Message | Tom Lane | 2026-01-19 16:21:42 | Re: tablecmds: clarify recurse vs recusing |