Re: Potential "AIO / io workers" inter-worker locking issue in PG18?

From: Marco Boeringa <marco(at)boeringa(dot)demon(dot)nl>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-bugs(at)lists(dot)postgresql(dot)org, Thom Brown <thom(at)linux(dot)com>
Subject: Re: Potential "AIO / io workers" inter-worker locking issue in PG18?
Date: 2025-10-20 09:34:19
Message-ID: eeb6b15c-1b06-4f4e-bab8-0b3fe1b5c80f@boeringa.demon.nl
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi David,

To be honest I am not a SQL wizard like some of you here on the list,
but part of the reason I setup the query as it currently is, is that the
PostGIS function calls like ST_Area and ST_Perimeter can be very
expensive depending on the complexity and size of the geometry, and I
thus want to avoid at all cost to have to unnecessarily recalculate them
multiple times in the same query. Maybe I am misunderstanding how
PostgreSQL processes such queries, but I need the values multiple times
to calculate some other parameters. So unless PostgreSQL is smart enough
to cache the result and not execute ST_Area multiple times if it is used
multiple times in the same query, I thought it wise to separate out the
calculation and use the SELECT's results as input for the calculation of
the other parameters. Maybe that isn't actually needed, but I think I
remember seeing performance gains from the current setup when I
initially wrote it this way, but I am not entirely sure at this point in
time, it is a while ago. I know far to little about the internals of
PostgreSQL and its exact query processing to say anything really
sensible about this.

A couple of things that still strike me:

- As I wrote in a previous post, just before entering the Python
multi-threaded processing that generates the jobs, I am adding a primary
key column with unique objectids (GENERATED BY DEFAULT AS IDENTITY),
that is subsequently used in the join. I realize this actually an
enhancement request, but I am wondering why, if PostgreSQL already needs
to dig through the entire table to add a new column with unique objectid
values, it doesn't automatically update the statistics of the newly
added column and write out the proper number of records of the table,
that must be known by the end of the addition of the column, to the
'pg_class.reltuples' table you refer to? It seems to me it would save
the need of the ANALYZE here, and at least ensure that there is some
useful statistics available on the vital primary key column and about
the relation size?

- As I wrote in a previous post, the multi-threaded Python code creates
multiple jobs (dozens). What I am seeing is that only part of the jobs
is failing, and with some runs, none. E.g. I can run my tool 3 times
without issues, than the fourth run some of the jobs get the bad plan (I
only see this behavior in PG18, I never saw it in <= PG17). In all of
these cases, the input data is *exactly* the same. The planning behavior
therefor appears non-deterministic. As I understood it, PostgreSQL may
indeed have non-deterministic behavior if it switches to the genetic
algorithm on complex queries with many joins, but I have the feeling
that my query doesn't quite satisfy that level of complexity? Or am I
wrong here, and do you consider it likely it went through the genetic
algorithm? It would actually be desirable if EXPLAIN (especially
'auto_explain') output always showed whether the genetic algorithm was
activated, so one could judge if non-deterministic behavior of the
planner is expected.

- Lastly, did you notice the likely "good" plan I posted below the "bad"
one. I generated that one by simply copy the visible query to pgAdmin
and hitting EXPLAIN, so it's not the real thing as from 'auto_explain',
but it does show some marked differences between the plan. Do you have
any comments to add as to the differences?

Marco

Op 20-10-2025 om 03:33 schreef David Rowley:
> On Mon, 20 Oct 2025 at 09:37, Marco Boeringa <marco(at)boeringa(dot)demon(dot)nl> wrote:
>> I am not sure why in this nested loop, two index scans on essentially
>> the same key and table are executed. You can compare this bad plan with
> The query contains a self-join, so that's why you're seeing the index
> scanned twice in the query plan. If that's not needed, then you should
> remove it from the query. If objectid is unique for this table then I
> don't see why you need to join the table again to access the very same
> row that you're updating. Just put those function calls in the
> UPDATE's SET clause.
>
> (We do have self join elimination in v18, but I see that it's a bit
> overly strict in what it removes around looking for duplicate
> relations when one of them is the query's result relation. Likely that
> can be made better so it still looks for duplicate relations including
> the result relation, but just never considers removing that one, only
> the other duplicate(s).)
>
>> *** ACTUAL BAD PLAN AS CAPTURED BY auto_explain ***:
>> -> Index Scan using
>> landcover_grassy_small_scale_2_ply_pkey on
>> osm.landcover_grassy_small_scale_2_ply t1 (cost=0.38..2.39 rows=1
>> width=310) (actual time=5.176..462.613 rows=222396.00 loops=1)
>> Output: t1.way, t1.ctid, t1.objectid
>> Index Searches: 1
>> Buffers: shared hit=66411
> This table must have been VACUUMed or ANALYZEd either when it was
> empty or when it contained 1 row. There's no predicate here, so that
> estimate, aside from clamping to 1, comes directly from
> pg_class.reltuples. A new table or truncated table would never
> estimate 1 row as the planner always plays it safe when there are no
> statistics generated yet and assumes 10 pages worth of rows. I can't
> think of any specific reason why v18 behaves differently from v17 on
> this... Maybe you've gotten unlikely with an autovacuum timing thing
> and it's running at a slightly different time than in v17, perhaps
> because it completed the autovacuum of another table slightly quicker
> than v17 did. v18 can perform asynchronous reads for vacuum, maybe
> that could mean more vacuum_cost_page_hits and less
> vacuum_cost_page_misses when calculating vacuum_cost_limit.
>
> Or, perhaps you're doing something like performing a manual VACUUM
> after the tables have had all of their rows deleted?
>
> David

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Nikolay Shaplov 2025-10-20 11:14:29 Re: PATCH for BUG #18785: Pointer bmr.rel, dereferenced by passing as 1st parameter to function is checked for NULL later
Previous Message badfilez@gmail.com 2025-10-20 08:51:04 Re: PG17.6 zstd handling bug