Re: shared tempfile was not removed on statement_timeout (unreproducible)

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, "Bossart, Nathan" <bossartn(at)amazon(dot)com>
Subject: Re: shared tempfile was not removed on statement_timeout (unreproducible)
Date: 2019-12-13 02:03:47
Message-ID: CA+hUKGJStr-3B6qNnFEOpES8HHc3Wwe3wSrYYQJcQhHuTB9SdQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Dec 13, 2019 at 7:05 AM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> I have a nagios check on ancient tempfiles, intended to catch debris left by
> crashed processes. But triggered on this file:
>
> $ sudo find /var/lib/pgsql/12/data/base/pgsql_tmp -ls
> 142977 4 drwxr-x--- 3 postgres postgres 4096 Dec 12 11:32 /var/lib/pgsql/12/data/base/pgsql_tmp
> 169868 4 drwxr-x--- 2 postgres postgres 4096 Dec 7 01:35 /var/lib/pgsql/12/data/base/pgsql_tmp/pgsql_tmp11025.0.sharedfileset
> 169347 5492 -rw-r----- 1 postgres postgres 5619712 Dec 7 01:35 /var/lib/pgsql/12/data/base/pgsql_tmp/pgsql_tmp11025.0.sharedfileset/0.0
> 169346 5380 -rw-r----- 1 postgres postgres 5505024 Dec 7 01:35 /var/lib/pgsql/12/data/base/pgsql_tmp/pgsql_tmp11025.0.sharedfileset/1.0
>
> I found:
> 2019-12-07 01:35:56 | 11025 | postgres | canceling statement due to statement timeout | CLUSTER pg_stat_database_snap USI
> 2019-12-07 01:35:56 | 11025 | postgres | temporary file: path "base/pgsql_tmp/pgsql_tmp11025.0.sharedfileset/2.0", size 5455872 | CLUSTER pg_stat_database_snap USI

Hmm. I played around with this and couldn't reproduce it, but I
thought of something. What if the statement timeout is reached while
we're in here:

#0 PathNameDeleteTemporaryDir (dirname=0x7fffffffd010
"base/pgsql_tmp/pgsql_tmp28884.31.sharedfileset") at fd.c:1471
#1 0x0000000000a32c77 in SharedFileSetDeleteAll (fileset=0x80182e2cc)
at sharedfileset.c:177
#2 0x0000000000a327e1 in SharedFileSetOnDetach (segment=0x80a6e62d8,
datum=34385093324) at sharedfileset.c:206
#3 0x0000000000a365ca in dsm_detach (seg=0x80a6e62d8) at dsm.c:684
#4 0x000000000061621b in DestroyParallelContext (pcxt=0x80a708f20) at
parallel.c:904
#5 0x00000000005d97b3 in _bt_end_parallel (btleader=0x80fe9b4b0) at
nbtsort.c:1473
#6 0x00000000005d92f0 in btbuild (heap=0x80a7bc4c8,
index=0x80a850a50, indexInfo=0x80fec1ab0) at nbtsort.c:340
#7 0x000000000067445b in index_build (heapRelation=0x80a7bc4c8,
indexRelation=0x80a850a50, indexInfo=0x80fec1ab0, isreindex=true,
parallel=true) at index.c:2963
#8 0x0000000000677bd3 in reindex_index (indexId=16532,
skip_constraint_checks=true, persistence=112 'p', options=0) at
index.c:3591
#9 0x0000000000678402 in reindex_relation (relid=16508, flags=18,
options=0) at index.c:3807
#10 0x000000000073928f in finish_heap_swap (OIDOldHeap=16508,
OIDNewHeap=16573, is_system_catalog=false,
swap_toast_by_content=false, check_constraints=false,
is_internal=true, frozenXid=604, cutoffMulti=1, newrelpersistence=112
'p') at cluster.c:1409
#11 0x00000000007389ab in rebuild_relation (OldHeap=0x80a7bc4c8,
indexOid=16532, verbose=false) at cluster.c:622
#12 0x000000000073849e in cluster_rel (tableOid=16508, indexOid=16532,
options=0) at cluster.c:428
#13 0x0000000000737f22 in cluster (stmt=0x800cfcbf0, isTopLevel=true)
at cluster.c:185
#14 0x0000000000a7cc5c in standard_ProcessUtility (pstmt=0x800cfcf40,
queryString=0x800cfc120 "cluster t USING t_i_idx ;",
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
dest=0x800cfd030, completionTag=0x7fffffffe0b0 "") at utility.c:654

The CHECK_FOR_INTERRUPTS() inside the walkdir() loop could ereport()
out of there after deleting some but not all of your files, but the
code in dsm_detach() has already popped the callback (which it does
"to avoid infinite error recursion"), so it won't run again on error
cleanup. Hmm. But then... maybe the two log lines you quoted should
be the other way around for that.

> Actually, I tried using pg_ls_tmpdir(), but it unconditionally masks
> non-regular files and thus shared filesets. Maybe that's worth discussion on a
> new thread ?
>
> src/backend/utils/adt/genfile.c
> /* Ignore anything but regular files */
> if (!S_ISREG(attrib.st_mode))
> continue;

+1, that's worth fixing.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2019-12-13 02:10:50 Re: Memory-Bounded Hash Aggregation
Previous Message Will Leinweber 2019-12-12 23:32:05 Errors "failed to construct the join relation" and "failed to build any 2-way joins"