From: | Jacques Combrink <jacques(at)quantsolutions(dot)co(dot)za> |
---|---|
To: | Matheus Alcantara <matheusssilv97(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
Cc: | Daniil Davydov <3danissimo(at)gmail(dot)com>, Álvaro Herrera <alvherre(at)kurilemu(dot)de>, Alexandra Wang <alexandra(dot)wang(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue |
Date: | 2025-09-01 14:06:58 |
Message-ID: | db706cba-c86f-4999-8696-fe5545d4e630@quantsolutions.co.za |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
I came across this discussion after we had this problem at least twice
now at work.
I read through the discussion to see if this is the problem we are
experiencing.
At first, I thought this was not exactly our problem, because of the way
it was showed to be reproduced.
With a long running transaction that is listening, and then doing all
the other steps like like running through transactions until a second
pg_xact file is created, and then vacuuming etc etc...
But, I felt like we don't do this, so it cannot be the way we run into
the problem. Specifically, having a listener as a long running transaction.
So I set out to get it into the broken state, and here is how I can
reproduce it without long running queries:
1. Create another database
postgres=# CREATE DATABASE test;
CREATE DATABASE
2. Create a listener on the postgres database.
postgres=# LISTEN xx;
LISTEN
3. Create notifies for the test database.
Here I struggled to get a stuck notification in the queue (SELECT
pg_notification_queue_usage();).
It can happen with only a single notify from a psql connection, but I
get a higher hit rate with the following:
Create a notify.sql with the only contents being:
```
NOTIFY s, 'yappers';
```
Then run this against the test database with pgbench.
pgbench -n -c 80 -j 20 -t 1000 -f notify.sql test
4. Confirm that there is now something stuck in the queue:
postgres=# SELECT pg_notification_queue_usage();
pg_notification_queue_usage
-----------------------------
9.5367431640625e-07
(1 row)
If this still shows 0, then run step 3 again.
5. Consume xid's. I create a file consume.sql with the only contents being:
```
SELECT txid_current();
```
Then:
pgbench -n -c 80 -j 30 -t 100000 -f consume.sql postgres
6. Verify that a new file is created in the pg_xact folder.
If not, just run the previous step again.
7. Run vacuum freeze. (Remember to allow connections to template0
beforehand with `ALTER DATABASE template0 WITH ALLOW_CONNECTIONS true;`)
$ vacuumdb --all --freeze
vacuumdb: vacuuming database "postgres"
vacuumdb: vacuuming database "template0"
vacuumdb: vacuuming database "template1"
vacuumdb: vacuuming database "test"
8. Connect to the test database and try execute a listener.
test=# LISTEN anything;
ERROR: could not access status of transaction 81086
DETAIL: Could not open file "pg_xact/0000": No such file or directory.
test=#
Extra weirdness:
In step 2 create a connection to the test database and LISTEN there on
any channel, even the one you are notifying to:
And you don't have to run anything on that connection, there is no
backend_xmin on that connection, you don't start a transaction nothing,
you just run `LISTEN something;`
Then after step 7, as long as you don't close this connection, you will
not get an error when you try to set up a listener on the test database,
even on a new connection to the test database.
What I'm saying is after step 7 you can have two connections. One to
postgres database with active listener, and one to test database with
active listener.
As long as the postgres one is open the queue is stuck, ie `SELECT
pg_notification_queue_usage();` returns non-zero.
As long as the test database connection is open, new listeners do not
experience the error.
If you close the test database connection, but leave the postgres
connection, from now on any listener you try to create on test database
will error.
If you close the postgres connection, the queue clears immediately and
new LISTEN's on the test database will work.
This means it is possible to get rid of the problem without restarting
the postmaster if you can close connections until you close the one that
"caused" the problem.
Don't know if this is actually useful, but to me it seemed like everyone
believed it to be impossible to recover without restarting, so it's
something at least.
TLDR:
active listener on one database causes notify on another database to get
stuck.
At no point could I get a stuck notify if I don't have a listener on at
least one other database than the one I am notifying on. See the Extra
weirdness section.
At no point do you need to have any other queries running, there is
never an idle in transaction query needed for bad timing with the vacuum.
I hope I explained everything well enough so that one of you smart
people can find and fix the problem.
From | Date | Subject | |
---|---|---|---|
Next Message | Bertrand Drouvot | 2025-09-01 14:07:27 | Get rid of pgstat_count_backend_io_op*() functions |
Previous Message | Daniil Davydov | 2025-09-01 14:04:04 | Re: Replace magic numbers with strategy numbers for B-tree indexes |