Quick Links

Re: Potential "AIO / io workers" inter-worker locking issue in PG18?

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Marco Boeringa <marco(at)boeringa(dot)demon(dot)nl>
Cc:	pgsql-bugs(at)lists(dot)postgresql(dot)org, Thom Brown <thom(at)linux(dot)com>
Subject:	Re: Potential "AIO / io workers" inter-worker locking issue in PG18?
Date:	2025-10-08 13:04:03
Message-ID:	m7xvuuuf2arrefyhbwwkh665rarg6k2fqozbjnqsfmh23wgxij@h3qufelpfynu
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs

Hi,

On 2025-10-08 09:49:04 +0200, Marco Boeringa wrote:
> Yes, you could be right this is not related to AIO at all, but another issue
> introduced in PG18.

Right.

> For sure, I did not see this issue in <= PG17, so some change in PG18 is
> causing it. Additionally, there is a small chance it might be related to
> PostGIS, as that was upgraded as well (3.5.2 --> 3.6.0) during the PG
> upgrade, as PG18 requires PostGIS 3.6.0 minimum. And the query does use
> PostGIS functions, but none that AFAIK rely on e.g. a spatial
> index. Functions like ST_Area just process an individual geometry, not the
> spatial relationship between multiple geometries.

Can you test this with postgis 3.6 on 17?

> As I wrote before, this is a multi-threaded Python application (actually
> developed in a GIS), that uses Python's 'concurrent.futures' threading
> framework to create jobs of records to process for each thread,
> significantly speeding up the processing. The queries are in fact
> dynamically build by the code, and part of a much larger geoprocessing
> workflow, so it is hard to run them separately and provide a query plan
> (although in this case I could by rewriting part of the query below).
>
> However, I want to stress that any query plan is unlikely to yield
> anything. In normal circumstances and in PG17 and below, this code runs
> fine! And it is only 1 in maybe 4 runs in PG18 that goes berserk and makes a
> processing step that should takes < 10 seconds, all of a sudden take > 2
> hours.

The fact that it runs without a problem in 17 means it's actually rather
meaningful to look at the query plan. It could have changed. Separately, it
might help us to narrow down what changes to look at that could potentially
be causing problems.

> > So somehow >60% of the CPU time is spent fetching tuples corresponding to
> index entries. That seems ... a lot. Is it possible that you have a lot of
> dead rows in the involved tables?
>
> Yes, that is perfectly possible. However, the particular table is only just
> over 100k records.

> It is true that my code is designed to process literally *every* record in a
> table. However, I specifically set adjusted table storage parameters with
> much more aggressive vacuum settings (essentially forcing always vacuum
> after something like 10k dead tuples irrespective of the size of the
> table). This has worked really well, and I have successfully UPDATEd all of
> Facebook Daylight size > 1B records tables with the same code, without ever
> running into this particular issue, nor transaction ID wraparound issues.

Making vacuum more aggressive won't really help much if you have longrunning
queries/sessions, since vacuum can't clean up row versions that are still
visibile to some of the transactions.

> *** sudo perf -p <PID of one stuck postgres backend> -g -d 10 ***
> *** sudo perf report --no-children ***
>
> Samples: 40K of event 'task-clock:ppp', Event count (approx.): 10008250000
> Overhead Command Shared Object Symbol
> + 12,81% postgres postgres [.] LWLockAttemptLock
> + 10,70% postgres postgres [.] heap_hot_search_buffer
> + 9,27% postgres postgres [.]
> tts_heap_getsomeattrs.lto_priv.0
> + 6,77% postgres postgres [.] LWLockReleaseInternal
> + 5,89% postgres postgres [.] hash_search_with_hash_value
> + 5,79% postgres postgres [.] ExecInterpExpr
> + 4,14% postgres postgres [.] PinBuffer

Could you "unfold" the callstacks for the top entries? And/or attach a
perf report --no-children > somefile
(when redirecting to a file perf will include much more detail than when using
it interactively)

Greetings,

Andres Freund

In response to

Re: Potential "AIO / io workers" inter-worker locking issue in PG18? at 2025-10-08 07:49:04 from Marco Boeringa

Responses

Re: Potential "AIO / io workers" inter-worker locking issue in PG18? at 2025-10-08 17:46:42 from Marco Boeringa

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	Andres Freund	2025-10-08 13:07:45	Re: Potential "AIO / io workers" inter-worker locking issue in PG18?
Previous Message	Marco Boeringa	2025-10-08 12:13:28	Re: Potential "AIO / io workers" inter-worker locking issue in PG18?