Re: [PROPOSAL] Termination of Background Workers for ALTER/DROP DATABASE

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alexander Lakhin <exclusion(at)gmail(dot)com>, Iwata, Aya/岩田 彩 <iwata(dot)aya(at)fujitsu(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Kuroda, Hayato/黒田 隼人 <kuroda(dot)hayato(at)fujitsu(dot)com>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PROPOSAL] Termination of Background Workers for ALTER/DROP DATABASE
Date: 2026-03-19 00:54:04
Message-ID: abtJLEAsf1HZXWdR@paquier.xyz
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Mar 18, 2026 at 03:52:02PM -0400, Tom Lane wrote:
> which makes me wonder whether the problematic session is the second or
> third bgworker. I am not seeing entries indicating that those
> stopped, as there is for the first bgworker.

Looking at the logs produced at [1], the worker launched as number 1
would not be able to interact, it connects to the database postgres,
under PID 1616001, and is reported as exited by the postmaster.

The only interacting sessions would be:
1) The bgworker launched as number 2, connected to database testdb.
2) The session checking for pg_stat_activity, launched by
launch_bgworker(). The test was connected with the database we want
to rename, and this could interact as an extra session. This query
could be run while connected to the database postgres to reduce the
friction and discarding this one.

The timestamps of the logs tell that it takes 5 seconds for this host
to get out of the ALTER DATABASE .. RENAME TO, which implies that we
are looping inside CountOtherDBBackends() for 5 seconds. So it really
looks like the second bgworker is the one we are waiting for here.
Now, we are sure of the following things when we try to launch the
RENAME TO:
- The worker is seen in pg_stat_activity.
- The worker is already in worker_spi_main(), per its "LOG initialized
with" entry.
- The worker is connected to the database.
- The worker can receive signals.

How would it be possible for this worker to not receive the requests?
The only thing I could think of is that the postmaster does not have
the time to process the PMSIGNAL_BACKGROUND_WORKER_CHANGE requests?

The next thing would be to gather more data, I guess. The attached
would help in providing more information. If it happens that we are
able to send the requests and that the postmaster does not have the
time to process them, I don't really see what we can do except:
- Drop the portion of the tests for DROP DATABASE, SET TABLESPACE and
RENAME DB, because all these scenarios involve commands that work on
the same database as the worker connected, and if the postmaster does
not have the time to process the termination requests, I don't really
see what we could do. This could also point to a timing issue with
the feature in itself, of course.
- Revert the feature, stop playing with the buildfarm due to the end
of the release cycle, and rework it for v20.

For now I am planning for the attached to get more information from
widowbird, which should take a few days at worst. That would make
clear if we have a timing issue with the requests sent to the
postmaster. Launching the queries for worker_spi_launch() and
pg_stat_activity on the database postgres may also improve things, but
I don't really buy it, even if I may be wrong.

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=widowbird&dt=2026-03-17%2015%3A35%3A03
--
Michael

Attachment Content-Type Size
0001-Add-more-debugging-information-for-termination-tests.patch text/plain 2.7 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2026-03-19 01:39:30 Re: Row pattern recognition
Previous Message Sami Imseih 2026-03-19 00:52:22 Re: Proposal to allow setting cursor options on Portals