From: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
---|---|
To: | Matheus Alcantara <matheusssilv97(at)gmail(dot)com> |
Cc: | Daniil Davydov <3danissimo(at)gmail(dot)com>, Álvaro Herrera <alvherre(at)kurilemu(dot)de>, Alexandra Wang <alexandra(dot)wang(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue |
Date: | 2025-08-21 22:25:34 |
Message-ID: | CAD21AoCD+HXoc2QZCAS9d8ahDeikNqbnU0i6cQzpMFOEurkPPg@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Aug 20, 2025 at 2:18 PM Matheus Alcantara
<matheusssilv97(at)gmail(dot)com> wrote:
>
> On Tue Aug 19, 2025 at 2:37 PM -03, Daniil Davydov wrote:
> > Hi,
> >
> > On Tue, Aug 19, 2025 at 6:31 PM Matheus Alcantara
> > <matheusssilv97(at)gmail(dot)com> wrote:
> >>
> >> On Tue Aug 19, 2025 at 12:57 AM -03, Daniil Davydov wrote:
> >> > You have started a very long transaction, which holds its xid and prevents
> >> > vacuum from freezing it. But what if the backend is stuck not inside a
> >> > transaction? Maybe we can just hardcode a huge delay (not inside the
> >> > transaction) or stop process execution via breakpoint in gdb. If we will use it
> >> > instead of a long query, I think that this error may be reproducible.
> >> >
> >> But how could this happen in real scenarios? I mean, how the backend
> >> could be stuck outside a transaction?
> >>
> >
> > For now, I cannot come up with a situation where it may be possible.
> > Perhaps, such a lagging may occur during network communication,
> > but I couldn't reproduce it. Maybe other people know how we can achieve
> > this?
> >
> Reading more the code I understand that once the a NOTIFY command is
> received by a backend (and the transaction is committed) it will
> emedialy signal all other listener backends and if the listener backend
> is in idle it will consume the notification and then send it back to the
> client as a PqMsg_NotificationResponse, so if there is a network delay
> to send the notification from the listener backend back to the client I
> don't think that it would be possible to get this error, because the
> message was already dispatched by the backend and it will eventually get
> to the client and once the notification is dispatched the backend
> doesn't need to track it anymore (the queue pointers of the backend are
> advanced after the dispatch).
>
> Assuming that every SQL command is wrapped into a transaction (if it's
> not already inside in) I think a busy listener backend will always
> prevent the vacuum from freezing clog files past from its current xid,
> so any notification that is sent while the backend is busy will not have
> their transaction status removed from clog files anyway.
What about backend processes that don't have any xid or xmin (i.e.,
are read-only query and in idle-in-transaction)?
IIUC we process the notification entries at the beginning of the
server loop (see L4608 in postgres.c) and when reading a command (via
ProcessClientReadInterrupt()), but it seems to me that if a process is
in idle-in-transaction state it doesn't process the entries unless the
transaction is committed. I've reproduced the missing clog entry error
even if we have a notification on the queue with a valid listener,
with the following steps:
1. Initialize the database cluster.
2. Execute "ALTER DATABASE template0 WITH ALLOW_CONNECTIONS true;".
3. Start one psql session and execute:
-- Session 1
=# listen s;
LISTEN
=# begin;
BEGIN
(keep this session open to leave the process idle-in-transaction.)
4. Open another psql session and execute:
-- Session 2
=# begin;
BEGIN
=# select txid_current();
txid_current
--------------
756
(1 row)
=# notify s;
NOTIFY
=# commit;
COMMIT
The notification to the channel 's' should be available for the
session-1's transaction.
5. Consume enough XIDs to truncate clog entries.
-- Session 2
=# create extension xid_wraparound;
CREATE EXTENSION
=# select consume_xids(10_000_000);
NOTICE: consumed 10000000 / 10000000 XIDs, latest 0:10000757
consume_xids
--------------
10000757
(1 row)
=# select txid_current();
txid_current
--------------
10000758
(1 row)
6. Execute vacuum freeze on all databases:
$ vacuumdb --all --freeze
vacuumdb: vacuuming database "postgres"
vacuumdb: vacuuming database "template0"
vacuumdb: vacuuming database "template1"
$ psql -d postgres -c "select datname, datfrozenxid, age(datfrozenxid)
from pg_database"
datname | datfrozenxid | age
-----------+--------------+-----
postgres | 10000759 | 11
template0 | 10000759 | 11
template1 | 10000759 | 11
(3 rows)
7. On the first psql session:
-- Session 1
=# commit;
COMMIT
ERROR: could not access status of transaction 756
DETAIL: Could not open file "pg_xact/0000": No such file or directory.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
From | Date | Subject | |
---|---|---|---|
Next Message | Alexander Korotkov | 2025-08-21 22:27:17 | Re: VM corruption on standby |
Previous Message | Robert Treat | 2025-08-21 22:06:13 | Re: Adding REPACK [concurrently] |