Re: Potential "AIO / io workers" inter-worker locking issue in PG18?

From: Marco Boeringa <marco(at)boeringa(dot)demon(dot)nl>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-bugs(at)lists(dot)postgresql(dot)org, Thom Brown <thom(at)linux(dot)com>
Subject: Re: Potential "AIO / io workers" inter-worker locking issue in PG18?
Date: 2025-10-20 11:22:45
Message-ID: 6a0801d5-cd16-436c-950d-1713ee586654@boeringa.demon.nl
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi David,

I wrote in my last post somewhere:

"In all of these cases, the input data is *exactly* the same."

I think I need to correct this. Although the input base table
'landcover_grassy_small_scale_2_ply' is the same and the number of
records it has as well for each run, the actual contents of the database
views that represent the multi-threading jobs (in the example below
'osm_tmp_28232_ch3'), and the number of records they refer to, *can*
vary in a non-deterministic way, anywhere from 0 to about 5000 records.
Most jobs have several thousands, close to the limit of 5000, but some
can have considerably less. Actually, it appears that most of jobs
getting the bad plan have the lower number of records (dozens to a few
hundreds, instead of a few thousands), at least that is what I saw with
the last run, but I would need to do further testing to confirm.

That said, they are implemented as non-materialized ordinary database
views, so don't have their own statistics and such (the underlying table
the views refer to is a secondary table called '
landcover_grassy_small_scale_2_ply_pg' derived
from 'landcover_grassy_small_scale_2_ply' that was visible in the
'auto_explain' output)? So this shouldn't actually be that relevant?

Marco

See query below for reference:

UPDATE
    osm.landcover_grassy_small_scale_2_ply AS t1 SET
    area_geo = t2.area_geo,
    perim_geo = t2.perim_geo,
    compact_geo = CASE
        WHEN t2.area_geo > 0 THEN ((power(t2.perim_geo, 2) /
t2.area_geo) / (4 * pi()))
        ELSE 0
    END,
    npoints_geo = t2.npoints_geo,
    comp_npoints_geo = CASE
        WHEN t2.npoints_geo > 0 THEN (CASE
            WHEN t2.area_geo > 0 THEN ((power(t2.perim_geo, 2) /
t2.area_geo) / (4 * pi()))
            ELSE 0
        END / t2.npoints_geo)
        ELSE 0
    END,
    convex_ratio_geo = CASE
        WHEN ST_Area(ST_ConvexHull(way)::geography, TRUE) > 0 THEN
(t2.area_geo / ST_Area(ST_ConvexHull(way)::geography, TRUE))
        ELSE 1
    END
FROM
    (
    SELECT
        objectid,
        ST_Area(way::geography, TRUE) AS area_geo,
        ST_Perimeter(way::geography, TRUE) AS perim_geo,
        ST_NPoints(way) AS npoints_geo
    FROM
        osm.landcover_grassy_small_scale_2_ply) AS t2
WHERE
    (t2.objectid = t1.objectid)
    AND t1.objectid IN (
    SELECT
        t3.objectid
    FROM
        mini_test.osm.osm_tmp_28232_ch3 AS t3)

Op 20-10-2025 om 11:34 schreef Marco Boeringa:
> Hi David,
>
> To be honest I am not a SQL wizard like some of you here on the list,
> but part of the reason I setup the query as it currently is, is that
> the PostGIS function calls like ST_Area and ST_Perimeter can be very
> expensive depending on the complexity and size of the geometry, and I
> thus want to avoid at all cost to have to unnecessarily recalculate
> them multiple times in the same query. Maybe I am misunderstanding how
> PostgreSQL processes such queries, but I need the values multiple
> times to calculate some other parameters. So unless PostgreSQL is
> smart enough to cache the result and not execute ST_Area multiple
> times if it is used multiple times in the same query, I thought it
> wise to separate out the calculation and use the SELECT's results as
> input for the calculation of the other parameters. Maybe that isn't
> actually needed, but I think I remember seeing performance gains from
> the current setup when I initially wrote it this way, but I am not
> entirely sure at this point in time, it is a while ago. I know far to
> little about the internals of PostgreSQL and its exact query
> processing to say anything really sensible about this.
>
> A couple of things that still strike me:
>
> - As I wrote in a previous post, just before entering the Python
> multi-threaded processing that generates the jobs, I am adding a
> primary key column with unique objectids (GENERATED BY DEFAULT AS
> IDENTITY), that is subsequently used in the join. I realize this
> actually an enhancement request, but I am wondering why, if PostgreSQL
> already needs to dig through the entire table to add a new column with
> unique objectid values, it doesn't automatically update the statistics
> of the newly added column and write out the proper number of records
> of the table, that must be known by the end of the addition of the
> column, to the 'pg_class.reltuples' table you refer to? It seems to me
> it would save the need of the ANALYZE here, and at least ensure that
> there is some useful statistics available on the vital primary key
> column and about the relation size?
>
> - As I wrote in a previous post, the multi-threaded Python code
> creates multiple jobs (dozens). What I am seeing is that only part of
> the jobs is failing, and with some runs, none. E.g. I can run my tool
> 3 times without issues, than the fourth run some of the jobs get the
> bad plan (I only see this behavior in PG18, I never saw it in <=
> PG17). In all of these cases, the input data is *exactly* the same.
> The planning behavior therefor appears non-deterministic. As I
> understood it, PostgreSQL may indeed have non-deterministic behavior
> if it switches to the genetic algorithm on complex queries with many
> joins, but I have the feeling that my query doesn't quite satisfy that
> level of complexity? Or am I wrong here, and do you consider it likely
> it went through the genetic algorithm? It would actually be desirable
> if EXPLAIN (especially 'auto_explain') output always showed whether
> the genetic algorithm was activated, so one could judge if
> non-deterministic behavior of the planner is expected.
>
> - Lastly, did you notice the likely "good" plan I posted below the
> "bad" one. I generated that one by simply copy the visible query to
> pgAdmin and hitting EXPLAIN, so it's not the real thing as from
> 'auto_explain', but it does show some marked differences between the
> plan. Do you have any comments to add as to the differences?
>
> Marco
>
> Op 20-10-2025 om 03:33 schreef David Rowley:
>> On Mon, 20 Oct 2025 at 09:37, Marco Boeringa
>> <marco(at)boeringa(dot)demon(dot)nl> wrote:
>>> I am not sure why in this nested loop, two index scans on essentially
>>> the same key and table are executed. You can compare this bad plan with
>> The query contains a self-join, so that's why you're seeing the index
>> scanned twice in the query plan. If that's not needed, then you should
>> remove it from the query. If objectid is unique for this table then I
>> don't see why you need to join the table again to access the very same
>> row that you're updating. Just put those function calls in the
>> UPDATE's SET clause.
>>
>> (We do have self join elimination in v18, but I see that it's a bit
>> overly strict in what it removes around looking for duplicate
>> relations when one of them is the query's result relation. Likely that
>> can be made better so it still looks for duplicate relations including
>> the result relation, but just never considers removing that one, only
>> the other duplicate(s).)
>>
>>> *** ACTUAL BAD PLAN AS CAPTURED BY auto_explain ***:
>>>                     ->  Index Scan using
>>> landcover_grassy_small_scale_2_ply_pkey on
>>> osm.landcover_grassy_small_scale_2_ply t1  (cost=0.38..2.39 rows=1
>>> width=310) (actual time=5.176..462.613 rows=222396.00 loops=1)
>>>                           Output: t1.way, t1.ctid, t1.objectid
>>>                           Index Searches: 1
>>>                           Buffers: shared hit=66411
>> This table must have been VACUUMed or ANALYZEd either when it was
>> empty or when it contained 1 row. There's no predicate here, so that
>> estimate, aside from clamping to 1, comes directly from
>> pg_class.reltuples. A new table or truncated table would never
>> estimate 1 row as the planner always plays it safe when there are no
>> statistics generated yet and assumes 10 pages worth of rows. I can't
>> think of any specific reason why v18 behaves differently from v17 on
>> this... Maybe you've gotten unlikely with an autovacuum timing thing
>> and it's running at a slightly different time than in v17, perhaps
>> because it completed the autovacuum of another table slightly quicker
>> than v17 did. v18 can perform asynchronous reads for vacuum, maybe
>> that could mean more vacuum_cost_page_hits and less
>> vacuum_cost_page_misses when calculating vacuum_cost_limit.
>>
>> Or, perhaps you're doing something like performing a manual VACUUM
>> after the tables have had all of their rows deleted?
>>
>> David

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Amit Langote 2025-10-20 13:48:40 Re: Segfault in RI UPDATE CASCADE on partitioned tables with LIKE+ATTACH child (attnum drift)
Previous Message Nikolay Shaplov 2025-10-20 11:14:29 Re: PATCH for BUG #18785: Pointer bmr.rel, dereferenced by passing as 1st parameter to function is checked for NULL later