Replication conflicts not processed in ClientWrite

From: Magnus Hagander <magnus(at)hagander(dot)net>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Replication conflicts not processed in ClientWrite
Date: 2024-03-04 13:12:38
Message-ID: CABUevExBm_va9+iW0kgVuZbrLDUZ8VnL2wo2ig7jqqdGsy8ZKQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

When a backend is blocked on writing data (such as with a network
error or a very slow client), indicated with wait event ClientWrite,
it appears to not properly notice that it's overrunning
max_standby_streaming_delay, and therefore does not cancel the
transaction on the backend.

I've reproduced this repeatedly on Ubuntu 20.04 with PostgreSQL 15 out
of the debian packages. Curiously enough, if I install the debug
symbols and restart, in order to get a backtrace, it starts processing
the cancellation again and can no longer reproduce. So it sounds like
some timing issue around it.

My simple test was, with session 1 on the standby and session 2 on the primary:
Session 1: begin transaction isolation level repeatable read;
Session 1: select count(*) from testtable;
Session 2: alter table testtable rename to testtable2;
Session 1: select * from testtable t1 cross join testtable t2;
kill -STOP <the pid of session 1>

At this point, replication lag sartgs growing on the standby and it
never terminates the session.

If I then SIGCONT it, it will get terminated by replication conflict.

If I kill the session hard, the replication lag recovers immediately.

AFAICT if the confliact happens at ClientRead, for example, it's
picked up immediately, but there's something in ClientWrite that
prevents it.

My first thought would be OpenSSL, but this is reproducible both on
tls-over-tcp and on unix sockets.

--
Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

Browse pgsql-hackers by date

  From Date Subject
Next Message Ronan Dunklau 2024-03-04 13:16:28 Re: Failures in constraints regression test, "read only 0 of 8192 bytes"
Previous Message Hayato Kuroda (Fujitsu) 2024-03-04 13:11:14 RE: Some shared memory chunks are allocated even if related processes won't start