Re: Potential "AIO / io workers" inter-worker locking issue in PG18?

From: Marco Boeringa <marco(at)boeringa(dot)demon(dot)nl>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org, Thom Brown <thom(at)linux(dot)com>
Subject: Re: Potential "AIO / io workers" inter-worker locking issue in PG18?
Date: 2025-10-08 07:49:04
Message-ID: 53b44572-0ceb-4149-b361-07da2da91032@boeringa.demon.nl
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi Andres,

> Could you show perf report --no-children? That would show us which individual
functions, rather than call-stacks, take the longest.

See entirely below!

> I don't immediately see how this could be related to AIO.

Yes, you could be right this is not related to AIO at all, but another issue introduced in PG18. The only reason I initially thought of AIO was of course that it is one of the main new features of PG18, and I could imagine "workers" getting into some sort of inter-worker locking issues, just like threads can.

For sure, I did not see this issue in <= PG17, so some change in PG18 is causing it. Additionally, there is a small chance it might be related to PostGIS, as that was upgraded as well (3.5.2 --> 3.6.0) during the PG upgrade, as PG18 requires PostGIS 3.6.0 minimum. And the query does use PostGIS functions, but none that AFAIK rely on e.g. a spatial index. Functions like ST_Area just process an individual geometry, not the spatial relationship between multiple geometries.

As I wrote before, this is a multi-threaded Python application (actually developed in a GIS), that uses Python's 'concurrent.futures' threading framework to create jobs of records to process for each thread, significantly speeding up the processing. The queries are in fact dynamically build by the code, and part of a much larger geoprocessing workflow, so it is hard to run them separately and provide a query plan (although in this case I could by rewriting part of the query below).

However, I want to stress that any query plan is unlikely to yield anything. In normal circumstances and in PG17 and below, this code runs fine! And it is only 1 in maybe 4 runs in PG18 that goes berserk and makes a processing step that should takes < 10 seconds, all of a sudden take > 2 hours.

UPDATE osm.landcover_scrubs_small_scale_2_ply AS t1 SET area_geo
= t2.area_geo, perim_geo = t2.perim_geo, compact_geo = CASE WHEN
t2.area_geo > 0 THEN ((power(t2.perim_geo,2) / t2.area_geo) /
(4 * pi())) ELSE 0 END, npoints_geo = t2.npoints_geo,
comp_npoints_geo = CASE WHEN t2.npoints_geo > 0 THEN (CASE WHEN
t2.area_geo > 0 THEN ((power(t2.perim_geo,2) / t2.area_geo) /
(4 * pi())) ELSE 0 END / t2.npoints_geo) ELSE 0 END,
convex_ratio_geo = CASE WHEN
ST_Area(ST_ConvexHull(way)::geography,true) > 0 THEN
(t2.area_geo / ST_Area(ST_ConvexHull(way)::geography,true)) ELSE 1
END FROM (SELECT objectid,ST_Area(way::geography,true) AS
area_geo,ST_Perimeter(way::geography,true) AS
perim_geo,ST_NPoints(way) AS npoints_geo FROM
osm.landcover_scrubs_small_scale_2_ply)  AS t2 WHERE (t2.objectid
= t1.objectid) AND t1.objectid IN (SELECT t3.objectid FROM
mini_test.osm.osm_tmp_28128_ch5 AS t3)

> So somehow >60% of the CPU time is spent fetching tuples corresponding to
index entries. That seems ... a lot. Is it possible that you have a lot of
dead rows in the involved tables?

Yes, that is perfectly possible. However, the particular table is only
just over 100k records. It is true that my code is designed to process
literally *every* record in a table. However, I specifically set
adjusted table storage parameters with much more aggressive vacuum
settings (essentially forcing always vacuum after something like 10k
dead tuples irrespective of the size of the table). This has worked
really well, and I have successfully UPDATEd all of Facebook Daylight
size > 1B records tables with the same code, without ever running into
this particular issue, nor transaction ID wraparound issues.

One particular thing to note as well is that, due to careful design of
the jobs taking page locality into account, deliberately setting a low
table fill factor, and plenty of RAM, quite a few but not all of the
steps in the geoprocessing workflow, manage to run almost completely as
PostgreSQL 'HOT' updates, so for all records in the table (even for very
large ones).

*** sudo perf -p <PID of one stuck postgres backend> -g -d 10 ***
*** sudo perf report --no-children ***

Samples: 40K of event 'task-clock:ppp', Event count (approx.): 10008250000
  Overhead  Command   Shared Object      Symbol
+   12,81%  postgres  postgres           [.] LWLockAttemptLock
+   10,70%  postgres  postgres           [.] heap_hot_search_buffer
+    9,27%  postgres  postgres           [.]
tts_heap_getsomeattrs.lto_priv.0
+    6,77%  postgres  postgres           [.] LWLockReleaseInternal
+    5,89%  postgres  postgres           [.] hash_search_with_hash_value
+    5,79%  postgres  postgres           [.] ExecInterpExpr
+    4,14%  postgres  postgres           [.] PinBuffer
+    3,61%  postgres  postgres           [.]
heapam_index_fetch_tuple.lto_priv.0
+    3,00%  postgres  postgres           [.] ReleaseAndReadBuffer
+    2,43%  postgres  postgres           [.] LWLockAcquire
+    2,38%  postgres  postgres           [.] HeapTupleSatisfiesVisibility
+    2,36%  postgres  postgres           [.] index_getnext_tid
+    1,96%  postgres  postgres           [.] index_fetch_heap
+    1,90%  postgres  postgres           [.] _bt_readpage
+    1,85%  postgres  postgres           [.] LWLockRelease
+    1,74%  postgres  postgres           [.] heap_page_prune_opt
+    1,49%  postgres  postgres           [.] IndexNext
+    1,47%  postgres  postgres           [.] _bt_checkkeys
+    1,47%  postgres  postgres           [.] ExecScan
+    1,47%  postgres  postgres           [.] XidInMVCCSnapshot
+    1,19%  postgres  postgres           [.] StartReadBuffer
+    1,18%  postgres  postgres           [.]
GetPrivateRefCountEntry.lto_priv.0
+    1,12%  postgres  postgres           [.] _bt_next
+    1,10%  postgres  postgres           [.] UnpinBufferNoOwner.lto_priv.0
+    1,09%  postgres  postgres           [.] MemoryContextReset
+    1,08%  postgres  postgres           [.] btgettuple
+    0,91%  postgres  postgres           [.] _bt_check_compare.lto_priv.0
+    0,82%  postgres  postgres           [.] ExecEvalSysVar
+    0,82%  postgres  postgres           [.] ExecStoreBufferHeapTuple
+    0,71%  postgres  postgres           [.] hash_bytes
+    0,65%  postgres  postgres           [.] tts_virtual_clear.lto_priv.0
+    0,63%  postgres  postgres           [.] GlobalVisTestFor
+    0,63%  postgres  postgres           [.] ExecNestLoop
+    0,55%  postgres  postgres           [.] HeapTupleIsSurelyDead
+    0,52%  postgres  postgres           [.] ResourceOwnerForget
     0,48%  postgres  postgres           [.] slot_getsomeattrs_int
     0,42%  postgres  postgres           [.] PredicateLockTID
     0,40%  postgres  postgres           [.] ReadBufferExtended
     0,37%  postgres  postgres           [.] _bt_saveitem
     0,31%  postgres  postgres           [.] ExecIndexScan
     0,30%  postgres  libc.so.6          [.] __memcmp_sse2
     0,28%  postgres  postgres           [.] _bt_setuppostingitems
     0,26%  postgres  postgres           [.] ReleaseBuffer
     0,23%  postgres  postgres           [.] ResourceOwnerEnlarge
     0,23%  postgres  postgres           [.]
HeapCheckForSerializableConflictOut
     0,22%  postgres  postgres           [.] IncrBufferRefCount
     0,18%  postgres  postgres           [.] pgstat_count_io_op

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Laurenz Albe 2025-10-08 08:21:16 Re: BUG #19076: Generic query plan is extremely slow
Previous Message Etsuro Fujita 2025-10-08 06:09:52 Re: TRAP: failed Assert("outerPlan != NULL") in postgres_fdw.c