| From: | Alexander Lakhin <exclusion(at)gmail(dot)com> |
|---|---|
| To: | Michael Paquier <michael(at)paquier(dot)xyz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
| Cc: | Iwata, Aya/岩田 彩 <iwata(dot)aya(at)fujitsu(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Kuroda, Hayato/黒田 隼人 <kuroda(dot)hayato(at)fujitsu(dot)com>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Re: [PROPOSAL] Termination of Background Workers for ALTER/DROP DATABASE |
| Date: | 2026-03-31 07:00:00 |
| Message-ID: | f913fba1-da59-404c-9eb3-07c7304be637@gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hello Michael,
21.03.2026 04:46, Michael Paquier wrote:
> So we are able to send the requests to the workers, and these can take
> a long time before being processed by the postmaster. Querying
> directly "postgres" for the worker_spi_launch() and pg_stat_activity
> queries seems to have reduced the friction, with less requests to
> send. However, I don't think that this is the end of the story, even
> after 79a5911fe65b I have spotted one case of RENAME TO where the
> requests were sent for a bit more than 4s, before the postmaster had
> the idea to catch up. RENAME TO is the only one that can get slow
> (really no idea why), so I guess that we could always tweak things a
> bit more:
> 1) Extra injection point to increase the timeout (30s or 60s?) and
> give the postmaster more room to proceed the requests.
> 2) Remove this portion of the test, but it would be sad.
>
> I'll keep an eye for more failures, even if the situation is looking
> slightly better.
Having reproduced this locally (running 3 tests in parallel with
ALTER DATABASE RENAME repeated 200 times, on a slow riscv64 machine), I
discovered that in the bad case the worker doesn't reach the main loop in
time (and CHECK_FOR_INTERRUPTS() inside it), because it doesn't get out of
initialize_worker_spi() -> CommitTransactionCommand().
With this modification:
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3752,3 +3752,3 @@ CountOtherDBBackends(Oid databaseId, int *nbackends, int *nprepared)
*/
- int ntries = 50;
+ int ntries = 500;
@@ -3798,3 +3798,6 @@ CountOtherDBBackends(Oid databaseId, int *nbackends, int *nprepared)
if (!found)
+{
+elog(LOG, "!!!CountOtherDBBackends| found no backends, try %d", tries);
return false; /* no conflicting backends, so done */
+}
I can see the following:
... !!!CountOtherDBBackends| found no backends, try 1
# most of the calls (200 of 201) succeeded with try 1, but there are also:
... !!!CountOtherDBBackends| found no backends, try 7
... !!!CountOtherDBBackends| found no backends, try 51
... !!!CountOtherDBBackends| found no backends, try 74
... !!!CountOtherDBBackends| found no backends, try 84
So the backend is not completely stuck, but CommitTransactionCommand()
may take more than 5 seconds under some circumstances (maybe it's worth
investigating which exactly).
Best regards,
Alexander
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Masahiko Sawada | 2026-03-31 07:09:19 | Re: POC: Parallel processing of indexes in autovacuum |
| Previous Message | Haoyan Wang | 2026-03-31 06:34:47 | Re: Initial COPY of Logical Replication is too slow |