Re: checkpointer: PANIC: could not fsync file: No such file or directory

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: checkpointer: PANIC: could not fsync file: No such file or directory
Date: 2019-11-27 06:53:13
Message-ID: 20191127065313.GU16647@telsasoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

This same crash occured on a 2nd server.
Also qemu/KVM, but this time on a 2ndary ZFS tablespaces which (fails to) include the missing relfilenode.
Linux database7 3.10.0-957.10.1.el7.x86_64 #1 SMP Mon Mar 18 15:06:45 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

This is postgresql12-12.1-1PGDG.rhel7.x86_64 (same as first crash), running since:
|30515 Tue Nov 19 10:04:33 2019 S ? 00:09:54 /usr/pgsql-12/bin/postmaster -D /var/lib/pgsql/12/data/

Before that, this server ran v12.0 since Oct 30 (without crashing).

In this case, the pg_dump --snap finished and released its snapshot at 21:50,
and there were no ALTERed tables. I see a temp file written since the previous
checkpoint, but not by a parallel worker, as in the previous server's crash.

The crash happened while reindexing, though. The "DROP INDEX CONCURRENTLY" is
from pg_repack -i, and completed successfully, but is followed immediately by
the abort log. The folllowing "REINDEX toast..." failed. In this case, I
*guess* that the missing filenode is due to a dropped index (570627937 or
otherwise). I don't see any other CLUSTER, VACUUM FULL, DROP, TRUNCATE or
ALTER within that checkpoint interval (note, we have 1 minute checkpoints).

Note, I double checked on the first server which crashed, it definitely wasn't
running pg_repack or the reindex script, since I removed pg_repack12 from our
servers until 12.1 was installed, to avoid the "concurrently" progress
reporting crash fixed at 1cd5bc3c. So I think ALTER TABLE TYPE and REINDEX can
both trigger this crash, at least on v12.1.

Note I actually have *full logs*, which I've now saved. But here's an excerpt:

postgres=# SELECT log_time, message FROM ckpt_crash WHERE log_time BETWEEN '2019-11-26 23:40:20' AND '2019-11-26 23:48:58' AND user_name IS NULL ORDER BY 1;
2019-11-26 23:40:20.139-05 | checkpoint starting: time
2019-11-26 23:40:50.069-05 | checkpoint complete: wrote 11093 buffers (5.6%); 0 WAL file(s) added, 0 removed, 12 recycled; write=29.885 s, sync=0.008 s, total=29.930 s; sync files=71, longest=0.001 s, average=0.000 s; distance
=193388 kB, estimate=550813 kB
2019-11-26 23:41:16.234-05 | automatic analyze of table "postgres.public.postgres_log_2019_11_26_2300" system usage: CPU: user: 3.00 s, system: 0.19 s, elapsed: 10.92 s
2019-11-26 23:41:20.101-05 | checkpoint starting: time
2019-11-26 23:41:50.009-05 | could not fsync file "pg_tblspc/16401/PG_12_201909212/16460/973123799.10": No such file or directory
2019-11-26 23:42:04.397-05 | checkpointer process (PID 30560) was terminated by signal 6: Aborted
2019-11-26 23:42:04.397-05 | terminating any other active server processes
2019-11-26 23:42:04.397-05 | terminating connection because of crash of another server process
2019-11-26 23:42:04.42-05 | terminating connection because of crash of another server process
2019-11-26 23:42:04.493-05 | all server processes terminated; reinitializing
2019-11-26 23:42:05.651-05 | database system was interrupted; last known up at 2019-11-27 00:40:50 -04
2019-11-26 23:47:30.404-05 | database system was not properly shut down; automatic recovery in progress
2019-11-26 23:47:30.435-05 | redo starts at 3450/1B202938
2019-11-26 23:47:54.501-05 | redo done at 3450/205CE960
2019-11-26 23:47:54.501-05 | invalid record length at 3450/205CEA18: wanted 24, got 0
2019-11-26 23:47:54.567-05 | checkpoint starting: end-of-recovery immediate
2019-11-26 23:47:57.365-05 | checkpoint complete: wrote 3287 buffers (1.7%); 0 WAL file(s) added, 0 removed, 5 recycled; write=2.606 s, sync=0.183 s, total=2.798 s; sync files=145, longest=0.150 s, average=0.001 s; distance=85
808 kB, estimate=85808 kB
2019-11-26 23:47:57.769-05 | database system is ready to accept connections
2019-11-26 23:48:57.774-05 | checkpoint starting: time

< 2019-11-27 00:42:04.342 -04 postgres >LOG: duration: 13.028 ms statement: DROP INDEX CONCURRENTLY "child"."index_570627937"
< 2019-11-27 00:42:04.397 -04 >LOG: checkpointer process (PID 30560) was terminated by signal 6: Aborted
< 2019-11-27 00:42:04.397 -04 >LOG: terminating any other active server processes
< 2019-11-27 00:42:04.397 -04 >WARNING: terminating connection because of crash of another server process
< 2019-11-27 00:42:04.397 -04 >DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
< 2019-11-27 00:42:04.397 -04 >HINT: In a moment you should be able to reconnect to the database and repeat your command.
...
< 2019-11-27 00:42:04.421 -04 postgres >STATEMENT: begin; LOCK TABLE child.ericsson_sgsn_ss7_remote_sp_201911 IN SHARE MODE;REINDEX INDEX pg_toast.pg_toast_570627929_index;commit

Here's all the nondefault settings which seem plausibly relevant or interesting:
autovacuum_analyze_scale_factor | 0.005 | | configuration file
autovacuum_analyze_threshold | 2 | | configuration file
checkpoint_timeout | 60 | s | configuration file
max_files_per_process | 1000 | | configuration file
max_stack_depth | 2048 | kB | environment variable
max_wal_size | 4096 | MB | configuration file
min_wal_size | 4096 | MB | configuration file
shared_buffers | 196608 | 8kB | configuration file
shared_preload_libraries | pg_stat_statements | | configuration file
wal_buffers | 2048 | 8kB | override
wal_compression | on | | configuration file
wal_segment_size | 16777216 | B | override

(gdb) bt
#0 0x00007f07c0070207 in raise () from /lib64/libc.so.6
#1 0x00007f07c00718f8 in abort () from /lib64/libc.so.6
#2 0x000000000087752a in errfinish (dummy=<optimized out>) at elog.c:552
#3 0x000000000075c8ec in ProcessSyncRequests () at sync.c:398
#4 0x0000000000734dd9 in CheckPointBuffers (flags=flags(at)entry=256) at bufmgr.c:2588
#5 0x00000000005095e1 in CheckPointGuts (checkPointRedo=57518713529016, flags=flags(at)entry=256) at xlog.c:9006
#6 0x000000000050ff86 in CreateCheckPoint (flags=flags(at)entry=256) at xlog.c:8795
#7 0x00000000006e4092 in CheckpointerMain () at checkpointer.c:481
#8 0x000000000051fcd5 in AuxiliaryProcessMain (argc=argc(at)entry=2, argv=argv(at)entry=0x7ffd82122400) at bootstrap.c:461
#9 0x00000000006ee680 in StartChildProcess (type=CheckpointerProcess) at postmaster.c:5392
#10 0x00000000006ef9ca in reaper (postgres_signal_arg=<optimized out>) at postmaster.c:2973
#11 <signal handler called>
#12 0x00007f07c012ef53 in __select_nocancel () from /lib64/libc.so.6
#13 0x00000000004833d4 in ServerLoop () at postmaster.c:1668
#14 0x00000000006f106f in PostmasterMain (argc=argc(at)entry=3, argv=argv(at)entry=0x19d4280) at postmaster.c:1377
#15 0x0000000000484cd3 in main (argc=3, argv=0x19d4280) at main.c:228

#3 0x000000000075c8ec in ProcessSyncRequests () at sync.c:398
path = "pg_tblspc/16401/PG_12_201909212/16460/973123799.10", '\000' <repeats 14 times>, ...
failures = 1
sync_in_progress = true
hstat = {hashp = 0x19fd2f0, curBucket = 1443, curEntry = 0x0}
entry = 0x1a61260
absorb_counter = <optimized out>
processed = 23
sync_start = {tv_sec = 21582125, tv_nsec = 303557162}
sync_end = {tv_sec = 21582125, tv_nsec = 303536006}
sync_diff = <optimized out>
elapsed = <optimized out>
longest = 1674
total_elapsed = 7074
__func__ = "ProcessSyncRequests"

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Etsuro Fujita 2019-11-27 07:30:49 Re: Problem while updating a foreign table pointing to a partitioned table on foreign server
Previous Message Michael Paquier 2019-11-27 05:03:21 Re: psql - improve test coverage from 41% to 88%