Quick Links

Re: Potential "AIO / io workers" inter-worker locking issue in PG18?

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Marco Boeringa <marco(at)boeringa(dot)demon(dot)nl>
Cc:	pgsql-bugs(at)lists(dot)postgresql(dot)org, Thom Brown <thom(at)linux(dot)com>
Subject:	Re: Potential "AIO / io workers" inter-worker locking issue in PG18?
Date:	2025-10-08 19:08:28
Message-ID:	3uiezmcbzueur7icgprzxgm7og3zu4lofxmebaryqzmm3kz2uv@5vssptglj7gf
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs

Hi,

On 2025-10-08 19:46:42 +0200, Marco Boeringa wrote:
> > The fact that it runs without a problem in 17 means it's actually rather
> meaningful to look at the query plan. It could have changed. Separately, it
> might help us to narrow down what changes to look at that could potentially
> be causing problems.
>
> I would fully understand if this was an "ordinary" issue case with a simple
> self-contained query and with things going wrong each time in the same way.
> However, as said, besides the major issue of running the query separately
> from my geoprocessing workflow which involves many more steps - which would
> mean that any test outside of it would *not* be very much representative of
> what is going on inside my tool and geoprocessing workflow - there is the
> fact that things going wrong is a random anomaly. I cannot stress this
> enough: about 3-4 in 5 runs are OK, then a random follow up run *with
> exactly the same input data* turns out bad with the stall. Even if there was
> an easy way to run the query, I think the chance is highly likely the
> postgres query planner would come up with a decent plan, as in normal
> circumstances, there is no issue.

Even just knowing whether the "normal query plan" is the same one as we see in
profiles of "stuck" backends is valuable. Even if the query plan is perfectly
normal, it *still* is very important to know in which order the joins are
evaluated etc. But there also might be changes in the query plan between 17
and 18 that trigger the issue...

Without more details about what is expected to be run and what is actually
happening, it's just about impossible for us to debug this without a
reproducer that we can run and debug ourselves.

> > Making vacuum more aggressive won't really help much if you have
> longrunning queries/sessions, since vacuum can't clean up row versions that
> are still visibile to some of the transactions.
>
> My code batches the updates in sets of 2000 records at a time and then
> COMMITs, so the transactions themselves are limited in time and size, which
> should allow vacuum to do its job.

Are the "stuck" backends stuck within one 2000 record batch, or are they
"just" slower processing each batch?

So the query plan we have is a nested loop between at least three tables
(there are two joins, c.f. the two ExecNestLoop calls), where there presumably
are a lot of row [versions] on the inner side of the innermost join.

In [1] you showed a query. Reformated that looks like this:

UPDATE osm.landcover_scrubs_small_scale_2_ply AS t1
SET area_geo = t2.area_geo,
perim_geo = t2.perim_geo,
compact_geo = CASE WHEN t2.area_geo > 0 THEN ((power(t2.perim_geo,2) / t2.area_geo) / (4 * pi())) ELSE 0 END,
npoints_geo = t2.npoints_geo,
comp_npoints_geo = CASE WHEN t2.npoints_geo > 0 THEN (CASE WHEN t2.area_geo > 0 THEN ((power(t2.perim_geo,2) / t2.area_geo) / (4 * pi())) ELSE 0 END / t2.npoints_geo) ELSE 0 END,
convex_ratio_geo = CASE WHEN ST_Area(ST_ConvexHull(way)::geography,true) > 0 THEN (t2.area_geo / ST_Area(ST_ConvexHull(way)::geography,true)) ELSE 1 END
FROM (
SELECT
objectid,
ST_Area(way::geography,true) AS area_geo,
ST_Perimeter(way::geography,true) AS perim_geo,
ST_NPoints(way) AS npoints_geo
FROM osm.landcover_scrubs_small_scale_2_ply) AS t2
WHERE (t2.objectid = t1.objectid)
AND t1.objectid IN (SELECT t3.objectid FROM mini_test.osm.osm_tmp_28128_ch5 AS t3)

Which certainly fits with two nested loops, although I don't think I can infer
which order it the joins are in.

Is osm.landcover_scrubs_small_scale_2_ply.object_id unique?

Can there be multiple rows for one object_id in
mini_test.osm.osm_tmp_28128_ch5?

Are there indexes on mini_test.osm.osm_tmp_28128_ch5.unique_id and
osm.landcover_scrubs_small_scale_2_ply?

Greetings,

Andres Freund

[1] https://www.postgresql.org/message-id/53b44572-0ceb-4149-b361-07da2da91032%40boeringa.demon.nl

In response to

Re: Potential "AIO / io workers" inter-worker locking issue in PG18? at 2025-10-08 17:46:42 from Marco Boeringa

Responses

Re: Potential "AIO / io workers" inter-worker locking issue in PG18? at 2025-10-08 20:09:04 from Marco Boeringa
Re: Potential "AIO / io workers" inter-worker locking issue in PG18? at 2025-10-09 07:40:03 from Marco Boeringa
Re: Potential "AIO / io workers" inter-worker locking issue in PG18? at 2025-10-10 10:03:02 from Marco Boeringa
Re: Potential "AIO / io workers" inter-worker locking issue in PG18? at 2025-10-12 08:24:51 from Marco Boeringa

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	Marco Boeringa	2025-10-08 20:09:04	Re: Potential "AIO / io workers" inter-worker locking issue in PG18?
Previous Message	Marco Boeringa	2025-10-08 17:46:42	Re: Potential "AIO / io workers" inter-worker locking issue in PG18?