From: | Alexandra Wang <alexandra(dot)wang(dot)oss(at)gmail(dot)com> |
---|---|
To: | Daniil Davydov <3danissimo(at)gmail(dot)com> |
Cc: | noah(at)leadboat(dot)com, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Stepan Yankevych <Stepan_Yankevych(at)epam(dot)com>, Stepan Yankevych <stepya(at)ukr(dot)net>, "sergii(dot)zhuravlev(at)smartnet(dot)ua" <sergii(dot)zhuravlev(at)smartnet(dot)ua>, "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org>, Shruthi Gowda <gowdashru(at)gmail(dot)com> |
Subject: | Re: BUG #16961: Could not access status of transaction |
Date: | 2025-08-01 03:41:17 |
Message-ID: | CAK98qZ1Xvu-FketrQGiYWCLosSBf3S8POYyZ_Uh+no1K00zv+Q@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
Hello,
A customer encountered this issue, and thanks to Andrei Varashen's
very well described steps in the duplicate thread[1], I can reproduce
it on 16.3 and on the master branch.
On Thu, Jul 31, 2025 at 8:21 PM Daniil Davydov <3danissimo(at)gmail(dot)com> wrote:
> One of our clients faced this problem, and I found out that long
> transactions may have nothing to do with it.
>
> We have the following logic in the notify queue :
> If there are no listeners within all databases, and we are calling
> LISTEN, then we must iterate from 'tail' to 'head' of the queue and
> check statuses of transactions (see Exec_ListenPreCommit).
> If there is a pruned-away xid in the queue, we will try to access its
> status and get an error.
>
> Because the tail of the queue is not necessarily always advanced
> forward by the listeners, we can get such error without any long lived
> transactions.
>
Thank you Daniil, you are exactly correct!
Here's the stack trace on master branch:
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 8.1
* frame #0: 0x0000000102c958c8
postgres`SlruReportIOError(ctl=0x00000001037aa5e0, pageno=0, xid=757) at
slru.c:1084:4
frame #1: 0x0000000102c952b0
postgres`SimpleLruReadPage(ctl=0x00000001037aa5e0, pageno=0, write_ok=true,
xid=757) at slru.c:603:4
frame #2: 0x0000000102c95f7c
postgres`SimpleLruReadPage_ReadOnly(ctl=0x00000001037aa5e0, pageno=0,
xid=757) at slru.c:661:9
frame #3: 0x0000000102c866bc postgres`TransactionIdGetStatus(xid=757,
lsn=0x000000016d27b5e8) at clog.c:745:11
frame #4: 0x0000000102c9a924
postgres`TransactionLogFetch(transactionId=757) at transam.c:79:14
frame #5: 0x0000000102c9a74c
postgres`TransactionIdDidCommit(transactionId=757) at transam.c:130:14
frame #6: 0x0000000102de4a2c
postgres`asyncQueueProcessPageEntries(current=0x000000016d27b720,
stop=QueuePosition @ 0x000000016d27b690, page_buffer="\U0000001c",
snapshot=0x00000001310439a0) at async.c:2069:13
frame #7: 0x0000000102de47ec postgres`asyncQueueReadAllNotifications at
async.c:1981:18
frame #8: 0x0000000102de29f4 postgres`Exec_ListenPreCommit at
async.c:1127:3
frame #9: 0x0000000102de24e4 postgres`PreCommit_Notify at async.c:881:6
frame #10: 0x0000000102ca877c postgres`CommitTransaction at
xact.c:2341:2
frame #11: 0x0000000102ca3b2c postgres`CommitTransactionCommandInternal
at xact.c:3214:4
frame #12: 0x0000000102ca3a44 postgres`CommitTransactionCommand at
xact.c:3175:10
frame #13: 0x0000000103238d1c postgres`finish_xact_command at
postgres.c:2833:3
frame #14: 0x00000001032368d4
postgres`exec_simple_query(query_string="LISTEN test_chan;") at
postgres.c:1298:4
frame #15: 0x000000010323599c postgres`PostgresMain(dbname="test",
username="alex.wang") at postgres.c:4767:7
frame #16: 0x000000010322cca8
postgres`BackendMain(startup_data=0x000000016d27de48, startup_data_len=24)
at backend_startup.c:124:2
frame #17: 0x0000000103104bcc
postgres`postmaster_child_launch(child_type=B_BACKEND, child_slot=78,
startup_data=0x000000016d27de48, startup_data_len=24,
client_sock=0x000000016d27ded8) at launch_backend.c:290:3
frame #18: 0x000000010310c5a0
postgres`BackendStartup(client_sock=0x000000016d27ded8) at
postmaster.c:3587:8
frame #19: 0x000000010310a404 postgres`ServerLoop at postmaster.c:1702:6
frame #20: 0x0000000103108ebc postgres`PostmasterMain(argc=3,
argv=0x00006000018854e0) at postmaster.c:1400:11
frame #21: 0x0000000102f8f6b8 postgres`main(argc=3,
argv=0x00006000018854e0) at main.c:231:4
frame #22: 0x00000001940a2b98 dyld`start + 6076
This stack trace is from a LISTEN command issued by a new listener on
the channel. This new listener needs to process previously committed
notifications. If no other active connections are listening to the
same channel, it starts processing from the tail of the queue.
This problem occurs when an AsyncQueueEntry is still present, but the
sender's xid it contains is older than the database's frozenxid. This
can happen if a previous listener disconnects, not enough messages for
the notifier to advance the async queue's tail, but enough transactions
for VACUUM to truncate the pg_xact logs.
On Thu, Jul 31, 2025 at 8:21 PM Daniil Davydov <3danissimo(at)gmail(dot)com> wrote:
> > > > We could prevent the trouble if vac_truncate_clog() had access to
> the oldest
> > > > xid in the notification queue; it would set frozenXID to that value
> if
> > > > frozenXID would otherwise be older.
> > >
> > > Perhaps. I'm not sure how hard it is to extract the oldest xid in the
> > > queue (where "oldest" is defined as "numerically smallest"). The
> entries
> > > are in xid commit order which is a different thing.
> >
> > Yeah, it wouldn't be cheap in the general case. The value could be a
> field in
> > pg_control, updated by a separate VACUUM NOTIFY, which autovacuum would
> also
> > run roughly as often as autovacuum visits template0.
>
> As a temporary solution, which almost completely eliminates the
> possibility of such a situation, I suggest adding a
> AsyncQueueAdvanceTail call to vacuum (inside vac_update_datfrozenxid).
> I mean 'static asyncQueueAdvanceTail()' that should be made
> 'external'. Thus, we can get rid of the problem that I described
> above.
>
> But there is one more : if the listener client is lagging, the
> AsyncQueueAdvanceTail call will not save us (because it cannot advance
> tail any further than the listener's position in the queue).
> Again, this may be due to a very high load, but not because the client
> keeps the transaction open for a very long time.
>
> The best solution is to teach vacuum to recognize the minimum xid in
> constant time, but I didn't come up with any sane implementations.
> What do you think?
>
I'm not sure what's the best solution here. When ComputeXidHorizons()
determines the cutoff xid, it doesn't check asyncQueueControl for the
sender's xids stored in each AsyncQueueEntry. It make sense that the
async queue itself isn't expected to survive a database restart, but
should the sender's xids it stores be considered?
What are your thoughts?
Best,
Alex
From | Date | Subject | |
---|---|---|---|
Next Message | Alexandra Wang | 2025-08-01 03:50:42 | Re: BUG #16961: Could not access status of transaction |
Previous Message | Tom Lane | 2025-08-01 03:33:28 | Re: BUG #19000: gist index returns inconsistent result with gist_inet_ops |