Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Matheus Alcantara <matheusssilv97(at)gmail(dot)com>
Cc: Daniil Davydov <3danissimo(at)gmail(dot)com>, Álvaro Herrera <alvherre(at)kurilemu(dot)de>, Alexandra Wang <alexandra(dot)wang(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue
Date: 2025-08-21 22:25:34
Message-ID: CAD21AoCD+HXoc2QZCAS9d8ahDeikNqbnU0i6cQzpMFOEurkPPg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Aug 20, 2025 at 2:18 PM Matheus Alcantara
<matheusssilv97(at)gmail(dot)com> wrote:
>
> On Tue Aug 19, 2025 at 2:37 PM -03, Daniil Davydov wrote:
> > Hi,
> >
> > On Tue, Aug 19, 2025 at 6:31 PM Matheus Alcantara
> > <matheusssilv97(at)gmail(dot)com> wrote:
> >>
> >> On Tue Aug 19, 2025 at 12:57 AM -03, Daniil Davydov wrote:
> >> > You have started a very long transaction, which holds its xid and prevents
> >> > vacuum from freezing it. But what if the backend is stuck not inside a
> >> > transaction? Maybe we can just hardcode a huge delay (not inside the
> >> > transaction) or stop process execution via breakpoint in gdb. If we will use it
> >> > instead of a long query, I think that this error may be reproducible.
> >> >
> >> But how could this happen in real scenarios? I mean, how the backend
> >> could be stuck outside a transaction?
> >>
> >
> > For now, I cannot come up with a situation where it may be possible.
> > Perhaps, such a lagging may occur during network communication,
> > but I couldn't reproduce it. Maybe other people know how we can achieve
> > this?
> >
> Reading more the code I understand that once the a NOTIFY command is
> received by a backend (and the transaction is committed) it will
> emedialy signal all other listener backends and if the listener backend
> is in idle it will consume the notification and then send it back to the
> client as a PqMsg_NotificationResponse, so if there is a network delay
> to send the notification from the listener backend back to the client I
> don't think that it would be possible to get this error, because the
> message was already dispatched by the backend and it will eventually get
> to the client and once the notification is dispatched the backend
> doesn't need to track it anymore (the queue pointers of the backend are
> advanced after the dispatch).
>
> Assuming that every SQL command is wrapped into a transaction (if it's
> not already inside in) I think a busy listener backend will always
> prevent the vacuum from freezing clog files past from its current xid,
> so any notification that is sent while the backend is busy will not have
> their transaction status removed from clog files anyway.

What about backend processes that don't have any xid or xmin (i.e.,
are read-only query and in idle-in-transaction)?

IIUC we process the notification entries at the beginning of the
server loop (see L4608 in postgres.c) and when reading a command (via
ProcessClientReadInterrupt()), but it seems to me that if a process is
in idle-in-transaction state it doesn't process the entries unless the
transaction is committed. I've reproduced the missing clog entry error
even if we have a notification on the queue with a valid listener,
with the following steps:

1. Initialize the database cluster.

2. Execute "ALTER DATABASE template0 WITH ALLOW_CONNECTIONS true;".

3. Start one psql session and execute:

-- Session 1
=# listen s;
LISTEN
=# begin;
BEGIN

(keep this session open to leave the process idle-in-transaction.)

4. Open another psql session and execute:

-- Session 2
=# begin;
BEGIN
=# select txid_current();
txid_current
--------------
756
(1 row)
=# notify s;
NOTIFY
=# commit;
COMMIT

The notification to the channel 's' should be available for the
session-1's transaction.

5. Consume enough XIDs to truncate clog entries.

-- Session 2
=# create extension xid_wraparound;
CREATE EXTENSION
=# select consume_xids(10_000_000);
NOTICE: consumed 10000000 / 10000000 XIDs, latest 0:10000757
consume_xids
--------------
10000757
(1 row)
=# select txid_current();
txid_current
--------------
10000758
(1 row)

6. Execute vacuum freeze on all databases:

$ vacuumdb --all --freeze
vacuumdb: vacuuming database "postgres"
vacuumdb: vacuuming database "template0"
vacuumdb: vacuuming database "template1"

$ psql -d postgres -c "select datname, datfrozenxid, age(datfrozenxid)
from pg_database"
datname | datfrozenxid | age
-----------+--------------+-----
postgres | 10000759 | 11
template0 | 10000759 | 11
template1 | 10000759 | 11
(3 rows)

7. On the first psql session:

-- Session 1
=# commit;
COMMIT
ERROR: could not access status of transaction 756
DETAIL: Could not open file "pg_xact/0000": No such file or directory.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2025-08-21 22:27:17 Re: VM corruption on standby
Previous Message Robert Treat 2025-08-21 22:06:13 Re: Adding REPACK [concurrently]