Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue

From: "Matheus Alcantara" <matheusssilv97(at)gmail(dot)com>
To: "Masahiko Sawada" <sawada(dot)mshk(at)gmail(dot)com>, "Matheus Alcantara" <matheusssilv97(at)gmail(dot)com>
Cc: Álvaro Herrera <alvherre(at)kurilemu(dot)de>, "Joel Jacobson" <joel(at)compiler(dot)org>, "Arseniy Mukhin" <arseniy(dot)mukhin(dot)dev(at)gmail(dot)com>, "Rishu Bagga" <rishu(dot)postgres(at)gmail(dot)com>, "Yura Sokolov" <y(dot)sokolov(at)postgrespro(dot)ru>, "Daniil Davydov" <3danissimo(at)gmail(dot)com>, "Alexandra Wang" <alexandra(dot)wang(dot)oss(at)gmail(dot)com>, "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue
Date: 2025-10-25 00:32:54
Message-ID: DDQZB2AD34V4.3RH2USCA72AS8@gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for the review!

On Thu Oct 23, 2025 at 8:42 PM -03, Masahiko Sawada wrote:
> On Wed, Oct 22, 2025 at 10:25 AM Matheus Alcantara
> <matheusssilv97(at)gmail(dot)com> wrote:
>>
>> On Wed Oct 22, 2025 at 12:49 PM -03, Álvaro Herrera wrote:
>> > On 2025-Oct-22, Matheus Alcantara wrote:
>> >
>> >> I'm wondering if the 002_aborted_tx_notifies.pl is still needed with
>> >> this architecture being used. I think that it's not, but perhaps is a
>> >> good test to keep it?
>> >
>> > I'd rather have tests than not, but I'd think it needs to be behind
>> > PG_TEST_EXTRA because of things like
>> >
>> > +$node->safe_psql('postgres', 'select consume_xids(10000000);');
>> >
>> Attached v10 with wrapping into PG_TEST_EXTRA.
>
> Thank you for updating the patches!
>
> I've reviewed the patches and here are the comments.
>
> v10-0001-Prevent-VACUUM-from-truncating-XIDs-still-presen.patch:
>
> + *
> + * Returns InvalidTransactionId if there are no unprocessed notifications.
>
> This comment is not accurate since the function returns
> InvalidTransactionId even if all unprocessed notifications are created
> on other databases.
>
Fixed

> ---
> +TransactionId
> +GetOldestQueuedNotifyXid(void)
> +{
>
> How about renaming it to something like GetOldestNotifyTransactionId()?
>
Yeap, sounds better, fixed.

> ---
> + /* page_buffer must be adequately aligned, so use a union */
> + union
> + {
> + char buf[QUEUE_PAGESIZE];
> + AsyncQueueEntry align;
> + } page_buffer;
>
> asyncQueueReadAllNotifications() uses this union too, so how about
> define this type as AlignedQueueEntryPage or something and use it in
> both functions?
>
Good point, fixed.

> ---
> +
> + LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
> +
>
> I don't think we need an exclusive lock here.
>
Fixed.

> ---
> In GetOldestQueuedNotifyXid() why do we keep holding NotifyQueueLock
> while calculating the oldest XID in the queue?
>
Yeah, I don't think that it's necessary. The
asyncQueueReadAllNotifications() for example only hold the lock when
reading the QUEUE_BACKEND_POS and QUEUE_HEAD. I think that it's a
similar case here, fixed.

> ---
> +
> + /* Process all entries on this page up to head */
> + while (curoffset + QUEUEALIGN(AsyncQueueEntryEmptySize) <=
> QUEUE_PAGESIZE &&
> + !QUEUE_POS_EQUAL(pos, head))
> + {
> + qe = (AsyncQueueEntry *) (page_buffer.buf + curoffset);
> (snip)
> + /* Advance to next entry */
> + if (asyncQueueAdvance(&pos, qe->length))
> + break; /* advanced to next page */
> +
>
> I think the check "curoffset + QUEUEALIGN(AsyncQueueEntryEmptySize) <=
> QUEUE_PAGESIZE" is essentially the same as the one we do in
> asyncQueueAdvance(). I think we can refactor the inner loop in
> GetOldestQueuedNotifyXid() to something like:
>
> for (;;)
> {
> bool reachedEndOfPage;
>
> qe = (AsyncQueueEntry *) (page_buffer.buf + curoffset);
>
> // check qe->xid here...
>
> reachedEndOfPage = asyncQueueAdvance(&pos, qe->length);
>
> if (reachedEndOfPage || QUEUE_POS_EQUAL(pos, head))
> break;
> }
>
Fixed

> ---
> v10-0002-Add-tap-tests-for-listen-notify-vacuum-freeze.patch:
>
> This new test directory needs to have .gitignore.
>
Fixed

> ---
> +# Copyright (c) 2024-2025, PostgreSQL Global Development Group
> +
>
> It's better to have a short description of this test here.
>
Fixed

> ---
> +use File::Path qw(mkpath);
>
> It seems not necessary.
>
Fixed

> ---
> +$node->safe_psql('postgres',
> + 'CREATE TABLE t AS SELECT g AS a, g+2 AS b from
> generate_series(1,100000) g;'
>
> Why does it need to insert many rows?
>
I think that this value was left when I was writing the first versions of
this test, I don't think that it's necessary. I reduced to just 10 rows.

> ---
> +# --- Session 2, multiple notify's, and commit ---
> +for my $i (1 .. 10)
> +{
> + $node->safe_psql(
> + 'postgres', "
> + BEGIN;
> + NOTIFY s, '$i';
> + COMMIT;");
> +}
>
> Why does it need to send a notification 10 times?
>
I just wanted to test with multiple notifications, we can reduce to two
or three, there is not specific reason to send 10 notifications.

> ---
> +# Consume enough XIDs to trigger truncation
> +$node->safe_psql('postgres', 'select consume_xids(10000000);');
> +
>
> I guess it consumes XID too much to trigger truncation. Given the one
> clog segment file is 256kB in size, it's enough to consume 1,050,000
> XIDs to move to the next segment file.
>
Fixed

> ---
> +# Get the new datfrozenxid after vacuum freeze to ensure that is advanced but
> +# we can still get the notification status of the notification
> +my $datafronzenxid_freeze = $node->safe_psql('postgres', "select
> datfrozenxid from pg_database where datname = 'postgres'");
> +ok($datafronzenxid_freeze > $datafronzenxid, 'datfrozenxid is advanced');
>
> I think this test passes in all cases, so it is meaningless. Instead,
> what we need to check in terms of datfrozenxid is that its value
> doesn't get greater than the XID we used for the notification, even
> after vacuum freeze. I think we remember the XID used for NOTIFY and
> check if the old/new datfrozenxid values are older than that value.
>
I think that was the goal of this test. I've swap the values to make it
more clear. Please let me know if I misunderstood your point here.

> ---
> +is($notifications_count, 10, 'received all committed notifications');
> +
> +done_testing();
>
> After consuming the unconsumed notifications, let's do vacuum freeze
> and check if datfrozenxid now can be advanced.
>
Good point, during this I realize that the datafrozenxid was not being
advanced even after the queue being consumed. This was because when a
backend listener is consuming the notifications it only update their own
queue tail pointer and not the global shared tail pointer, so I've
included a call to asyncQueueAdvanceTail() at the beginning of
GetOldestNotifyTransactionId() to do this.

> ---
> I would expect to add 002_aborted_tx_notifies.pl in a separate patch
> since it's not related to this bug fix.
>
On the new attached version I've moved this test to a separated patch.

> ---
> +# Test checks that listeners do not receive notifications from aborted
> +# transaction even if notifications have been added to the listen/notify
> +# queue. To reproduce it we use the fact that serializable conflicts
> +# are checked after tx adds notifications to the queue.
>
> I wonder if we could implement this test using the isolation test
> instead of the tap test. Is there any reason why you used a tap test
> for that?
>
On this new version I just moved the test into a separated patch but I
agree that we can implement this using the isolation test,

I don't know how we should handle the follow-up patches that Joel have
sent on [1], if we should incoportate on 0001 or not. I still need to
review and test them.

[1] https://www.postgresql.org/message-id/b6f6cbb8-1903-4ca0-ae64-01d84d1e12a3%40app.fastmail.com

--
Matheus Alcantara

Attachment Content-Type Size
v11-0001-Prevent-VACUUM-from-truncating-XIDs-still-presen.patch text/plain 15.6 KB
v11-0002-Add-more-tests-for-listen-notify-vacuum-freeze.patch text/plain 4.0 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Matheus Alcantara 2025-10-25 00:33:33 Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue
Previous Message Sami Imseih 2025-10-25 00:04:59 Re: Bug in pg_stat_statements