BUG #18909: Query creates millions of temporary files and stalls

From: PG Bug reporting form <noreply(at)postgresql(dot)org>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Cc: skoposov(at)ed(dot)ac(dot)uk
Subject: BUG #18909: Query creates millions of temporary files and stalls
Date: 2025-05-02 18:45:48
Message-ID: 18909-e5e1b702c9441b8a@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 18909
Logged by: Sergey Koposov
Email address: skoposov(at)ed(dot)ac(dot)uk
PostgreSQL version: 15.12
Operating system: Linux, Debian 10
Description:

Hi,
I have a query:
select phot_g_mean_mag,phot_bp_mean_mag,phot_rp_mean_mag,
g.source_id,pmra,pmdec,parallax,parallax_error,
d.ra,d.dec,d.ebv,d.gmeanpsfmag,rmeanpsfmag,imeanpsfmag,gkronmag,rkronmag
from disk_sample1 as d left join
gaia_edr3_aux.panstarrs1bestneighbour as gaps1 on
(gaps1.original_ext_source_id=d.objid)
left join gaia_edr3.gaia_source as g on ( g.source_id =
gaps1.source_id )
that I ran (from python client), and after it didn't finish in ~ an hour, i
started to investigate, finding that the pgsql_tmp folder had 15 million
temporary files from that query.
If try to explain it in psql, It has this plan:
------------------------------------------------------------------------------------------------------------------------------
Gather (cost=14443102.62..111487693.90 rows=65470868 width=100)
Workers Planned: 4
-> Nested Loop Left Join (cost=14442102.62..104939607.10 rows=16367717
width=100)
-> Parallel Hash Left Join (cost=14442102.04..22124798.60
rows=16367717 width=60)
Hash Cond: (d.objid = gaps1.original_ext_source_id)
-> Parallel Seq Scan on disk_sample1 d
(cost=0.00..1218371.17 rows=16367717 width=60)
-> Parallel Hash (cost=10307380.24..10307380.24
rows=237862624 width=16)
-> Parallel Seq Scan on panstarrs1bestneighbour
gaps1 (cost=0.00..10307380.24 rows=237862624 width=16)
-> Index Scan using gaia_sourcex_source_id_idx on gaia_source g
(cost=0.58..5.05 rows=1 width=48)
Index Cond: (source_id = gaps1.source_id)
JIT:
Functions: 14
Options: Inlining true, Optimization true, Expressions true, Deforming
true
(13 rows)
And if I just do create table
create table disk_sample1_gaia as select
phot_g_mean_mag,phot_bp_mean_mag,phot_rp_mean_mag,
g.source_id,pmra,pmdec,parallax,parallax_error,d.ra,d.dec,d.ebv,d.gmeanpsfmag,rmeanps
fmag,imeanpsfmag,gkronmag,rkronmag from disk_sample1 as d left join
gaia_edr3_aux.panstarrs1bestneighbour as gaps1 on
(gaps1.original_ext_source_id=d.objid) left join gaia_edr3.gaia
_source as g on ( g.source_id = gaps1.source_id );
That command finished fine without creating many temporary files. The
resulting table has 65mill rows.
However if I again run the query from python, I very quickly saw postgresql
creating 300k+ files. At which point I interrupted the query.
The way the query is run from python is using
https://github.com/segasai/sqlutilpy
which basically does this (using psycopg):
1) creates named cursor
2) set cursor_tuple_fraction TO 1
3) FETCH the result in 100,000 row chunks
Here's the gdb trace of postgres from the situation when the pg tries to
create millions of tmp files
**********
#0 0x00007fcd4ce6cc8b in __libc_open64 (file=file(at)entry=0x7ffcce70e3e0
"base/pgsql_tmp/pgsql_tmp78857.224669", oflag=oflag(at)entry=578) at
../sysdeps/unix/sysv/linux/open64.c:48
#1 0x000056430b609eac in BasicOpenFilePerm (fileName=0x7ffcce70e3e0
"base/pgsql_tmp/pgsql_tmp78857.224669", fileFlags=578, fileMode=384) at
fd.c:1125
#2 0x000056430b60a2f3 in PathNameOpenFilePerm (fileName=0x7ffcce70e3e0
"base/pgsql_tmp/pgsql_tmp78857.224669", fileFlags=578, fileMode=384) at
fd.c:1603
#3 0x000056430b60a4fa in PathNameOpenFile (fileFlags=578,
fileName=0x7ffcce70e3e0 "base/pgsql_tmp/pgsql_tmp78857.224669") at fd.c:1806
#4 OpenTemporaryFileInTablespace (tblspcOid=<optimized out>,
rejectError=<optimized out>) at fd.c:1806
#5 0x000056430b60a6d9 in OpenTemporaryFile
(interXact=interXact(at)entry=false) at fd.c:1743
#6 0x000056430b6082c4 in BufFileCreateTemp
(interXact=interXact(at)entry=false) at buffile.c:204
#7 0x000056430b4ccaeb in ExecHashJoinSaveTuple
(tuple=tuple(at)entry=0x56430c243cd8, hashvalue=<optimized out>,
hashvalue(at)entry=1412196535, fileptr=0x7fc500720d98)
at nodeHashjoin.c:1248
#8 0x000056430b4c9b3b in ExecHashTableInsert
(hashtable=hashtable(at)entry=0x56430c2434a8, slot=slot(at)entry=0x56430c208178,
hashvalue=hashvalue(at)entry=1412196535) at nodeHash.c:1714
#9 0x000056430b4ca2fc in MultiExecPrivateHash (node=<optimized out>) at
nodeHash.c:188
#10 MultiExecHash (node=node(at)entry=0x56430c10e098) at nodeHash.c:115
#11 0x000056430b4b213e in MultiExecProcNode (node=node(at)entry=0x56430c10e098)
at execProcnode.c:520
#12 0x000056430b4cb5dc in ExecHashJoinImpl (parallel=false,
pstate=0x56430c10ca68) at nodeHashjoin.c:297
#13 ExecHashJoin (pstate=0x56430c10ca68) at nodeHashjoin.c:621
#14 0x000056430b4aa92a in ExecProcNode (node=0x56430c10ca68) at
../../../src/include/executor/executor.h:262
#15 ExecutePlan (queryDesc=queryDesc(at)entry=0x56430c121928,
operation=operation(at)entry=CMD_SELECT, sendTuples=sendTuples(at)entry=true,
numberTuples=numberTuples(at)entry=100000,
direction=direction(at)entry=ForwardScanDirection,
dest=dest(at)entry=0x56430bfb70b8) at execMain.c:1640
#16 0x000056430b4ab03b in standard_ExecutorRun (queryDesc=0x56430c121928,
direction=ForwardScanDirection, count=100000, execute_once=<optimized out>)
at execMain.c:362
#17 0x00007fcd4d2556d5 in explain_ExecutorRun (queryDesc=0x56430c121928,
direction=ForwardScanDirection, count=100000, execute_once=<optimized out>)
at auto_explain.c:322
#18 0x000056430b641989 in PortalRunSelect (portal=0x56430c062378,
forward=<optimized out>, count=100000, dest=<optimized out>) at pquery.c:922
#19 0x000056430b643356 in PortalRunFetch (portal=0x56430c062378,
fdirection=FETCH_FORWARD, count=100000, dest=dest(at)entry=0x56430bfb70b8) at
pquery.c:1425
#20 0x000056430b45ad19 in PerformPortalFetch
(stmt=stmt(at)entry=0x56430bf5e710, dest=dest(at)entry=0x56430bfb70b8,
qc=qc(at)entry=0x7ffcce70ee00) at portalcmds.c:198
#21 0x000056430b644462 in standard_ProcessUtility (pstmt=0x56430bf5ea30,
queryString=0x56430bf5dd08 "FETCH FORWARD 100000 FROM \"sqlutilcursor\"",
readOnlyTree=<optimized out>,
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
dest=0x56430bfb70b8, qc=0x7ffcce70ee00) at utility.c:710
#22 0x000056430b641e90 in PortalRunUtility
(portal=portal(at)entry=0x56430c062268, pstmt=0x56430bf5ea30,
isTopLevel=isTopLevel(at)entry=true,
setHoldSnapshot=setHoldSnapshot(at)entry=true,
dest=dest(at)entry=0x56430bfb70b8, qc=qc(at)entry=0x7ffcce70ee00) at
pquery.c:1156
#23 0x000056430b642517 in FillPortalStore
(portal=portal(at)entry=0x56430c062268, isTopLevel=isTopLevel(at)entry=true) at
../../../src/include/nodes/pg_list.h:260
#24 0x000056430b64318d in PortalRun (portal=portal(at)entry=0x56430c062268,
count=count(at)entry=9223372036854775807, isTopLevel=isTopLevel(at)entry=true,
run_once=run_once(at)entry=true,
dest=dest(at)entry=0x56430bf5eb10, altdest=altdest(at)entry=0x56430bf5eb10,
qc=0x7ffcce70eff0) at pquery.c:761
#25 0x000056430b63ee91 in exec_simple_query (query_string=0x56430bf5dd08
"FETCH FORWARD 100000 FROM \"sqlutilcursor\"") at postgres.c:1254
#26 0x000056430b63fbd5 in PostgresMain (dbname=<optimized out>,
username=<optimized out>) at postgres.c:4691
#27 0x000056430b5ac569 in BackendRun (port=<optimized out>, port=<optimized
out>) at postmaster.c:4515
#28 BackendStartup (port=<optimized out>) at postmaster.c:4243
#29 ServerLoop () at postmaster.c:1811
--Type <RET> for more, q to quit, c to continue without paging--
#30 0x000056430b5ad627 in PostmasterMain (argc=3, argv=0x56430bf574f0) at
postmaster.c:1483
#31 0x000056430b2f2d84 in main (argc=3, argv=0x56430bf574f0) at main.c:204
*************
Interesting when I tried to run the same query when disabling jit, i.e.
set jit_above_cost=-1 before the query, then the query finished fine without
creating millions of tmp files
Also, I tried to run the query just in psql. ( I didn't finish it, since
it's 60+ mil rows), but I think it works fine, i don't see explosion in
number of tmp files. , so I think the fact that the named cursor was used is
important.
I'm happy to provide more info if I can/ if needed.
Thanks,
Sergey

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2025-05-02 22:12:58 Re: BUG #18909: Query creates millions of temporary files and stalls
Previous Message Richard Guo 2025-05-02 07:03:38 Re: BUG #18902: TRAP:: failed Assert("!is_sorted") in File: "createplan.c"