Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

From: Matheus Alcantara <matheusssilv97(at)gmail(dot)com>
To: Daniil Davydov <3danissimo(at)gmail(dot)com>
Cc: Álvaro Herrera <alvherre(at)kurilemu(dot)de>, Alexandra Wang <alexandra(dot)wang(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue
Date: 2025-08-19 00:14:15
Message-ID: CAFY6G8cnDOMZCfSvc62vER_9+HOs4EM52XromhOZki+=Ytv2sg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed Aug 13, 2025 at 4:29 PM -03, Daniil Davydov wrote:
> Hi,
>
> On Mon, Aug 11, 2025 at 8:41 PM Matheus Alcantara
> <matheusssilv97(at)gmail(dot)com> wrote:
>>
>> On Wed Aug 6, 2025 at 7:44 AM -03, Álvaro Herrera wrote:
>> >> My questions:
>> >>
>> >> 1. Is it acceptable to drop notifications from the async queue if
>> >> there are no active listeners? There might still be notifications that
>> >> haven’t been read by any previous listener.
>> >
>> > I'm somewhat wary of this idea -- could these inactive listeners become
>> > active later and expect to be able to read their notifies?
>> >
>> I'm bit worry about this too.
>
> What exactly do we mean by "active listener"? According to the source code,
> the active listener (as far as I understand) is the one who listens to at least
> one channel. If we have no active listeners in the database, the new listener
> will set its pointer to the tail of the async queue. Thus, messages with old
> xid will not be touched by anybody. I don't see any point in dropping them
> in this case.
>
I think that this definition is correct, but IIUC the tail can still
have notifications with xid's that were already truncated by vacuum
freeze. When the LISTEN is executed, we first loop through the
notification queue to try to advance the queue pointers and we can
eventually iterate over a notification that was added on the queue
without any listener but it has a xid that is already truncated by vacuum
freeze, so in this case it will fail to get the transaction status. On
Alex steps to reproduce the issue it first executes the NOTIFY and
then executes the LISTEN which fails after vacuum freeze.

> If the "inactive" listener is the backend which is stuck somewhere, the
> answer is "no" - this backend should be able to process all notifications.
>
I tried to reproduce the issue by using some kind of "inactive"
listener but so far I didn't manage to trigger the error. This is what I
tried:

1. Create listener:
postgres=# listen c1;

2. Execute a very long query to make the backend busy to process the
notification:
postgres=# select * from generate_series(1,10000000000) g where g > 1;

3. On another session send the notification
postgres=# notify c1;

4. Execute pgbench test:
pgbench -n -c 80 -j 10 -t 15000 -f test.sql postgres

5. Verify that we have multiple files on pg_xact:
➜ ls -lah ~/pg-devel/data/pg_xact
total 608
-rw-------@ 1 matheus staff 256K Aug 18 20:36 0000
-rw-------@ 1 matheus staff 40K Aug 18 20:56 0001

6. Execute VACUUM FREEZE on every database on the server

postgres=# VACUUM FREEZE;
VACUUM

postgres=# \c template1
You are now connected to database "template1" as user "postgres".
template1=# VACUUM FREEZE;
VACUUM

template1=# \c template0
You are now connected to database "template0" as user "postgres".
template0=# VACUUM FREEZE;
VACUUM

After the vacuum freeze I still can see the same files on pg_xact/ and
if I cancel the long query the notification is received correctly, and
then if I execute vacuum freeze again on every database the oldest
pg_xact file is truncated.

So, if my tests are correct I don't think that storing the oldest xid is
necessary anymore since I don't think that we can lose notifications
using the patch from Daniil or I'm missing something here?

Thinking about this, maybe another solution for this would be to change
queue advancing pointers to skip the transaction status check? With this
we would not need to touch any vacuum freeze code and instead change the
LISTEN execution to handle this scenario.

Thoughts?

--
Matheus Alcantara

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2025-08-19 00:14:36 Re: A few patches to clarify snapshot management
Previous Message Andres Freund 2025-08-19 00:07:23 Re: Potential deadlock in pgaio_io_wait()