Re: checkpointer: PANIC: could not fsync file: No such file or directory

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: checkpointer: PANIC: could not fsync file: No such file or directory
Date: 2019-11-20 01:22:26
Message-ID: 20191120012226.GW30362@telsasoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Nov 19, 2019 at 04:49:10PM -0600, Justin Pryzby wrote:
> On Wed, Nov 20, 2019 at 09:26:53AM +1300, Thomas Munro wrote:
> > Perhaps we should not panic if we failed to open (not fsync) the file,
> > but it's not the root problem here which is that somehow we thought we
> > should be fsyncing a file that had apparently been removed already
> > (due to CLUSTER, VACUUM FULL, DROP, rewriting ALTER etc).

Note, the ALTER was (I think) building index in a parallel process:

2019-11-15 22:16:08.752-05 | 5976 | terminating connection because of crash of another server process
2019-11-15 22:16:08.751-05 | 27384 | checkpointer process (PID 27388) was terminated by signal 6: Aborted
2019-11-15 22:16:08.751-05 | 27384 | terminating any other active server processes
2019-11-15 22:16:07.098-05 | 27388 | could not fsync file "base/16491/1731839470.2": No such file or directory
2019-11-15 22:15:59.592-05 | 19860 | duration: 220283.831 ms statement: ALTER TABLE child.eric_enodeb_cell_201811 ALTER pmradiothpvolulscell TYPE integer USING pmradiothpvolulscell::integer
2019-11-15 22:15:59.459-05 | 19860 | temporary file: path "base/pgsql_tmp/pgsql_tmp19860.82.sharedfileset/1.0", size 5144576
2019-11-15 22:15:59.458-05 | 19860 | temporary file: path "base/pgsql_tmp/pgsql_tmp19860.82.sharedfileset/2.0", size 6463488
2019-11-15 22:15:59.456-05 | 19860 | temporary file: path "base/pgsql_tmp/pgsql_tmp19860.82.sharedfileset/0.0", size 4612096

FYI, that table is *currently* (5 days later):
ts=# \dti+ child.eric_enodeb_cell_201811*
child | eric_enodeb_cell_201811 | table | telsasoft | | 2595 MB |
child | eric_enodeb_cell_201811_idx | index | telsasoft | eric_enodeb_cell_201811 | 120 kB |
child | eric_enodeb_cell_201811_site_idx | index | telsasoft | eric_enodeb_cell_201811 | 16 MB |

I don't know if that table is likely to be the one with relfilenode 1731839470
(but it certainly wasn't its index), or if that was maybe a table (or index)
from an earlier ALTER. I tentatively think we wouldn't have had any other
tables being dropped, partitions pruned or maintenance commands running.

Checkpoint logs for good measure:
2019-11-15 22:18:26.168-05 | 10388 | checkpoint complete: wrote 2915 buffers (3.0%); 0 WAL file(s) added, 0 removed, 18 recycled; write=30.022 s, sync=0.472 s, total=32.140 s; sync files=107, longest=0.364 s, average=0.004 s;
distance=297471 kB, estimate=297471 kB
2019-11-15 22:17:54.028-05 | 10388 | checkpoint starting: time
2019-11-15 22:16:53.753-05 | 10104 | checkpoint complete: wrote 98275 buffers (100.0%); 0 WAL file(s) added, 0 removed, 43 recycled; write=11.040 s, sync=0.675 s, total=11.833 s; sync files=84, longest=0.335 s, average=0.008 s
; distance=698932 kB, estimate=698932 kB
2019-11-15 22:16:41.921-05 | 10104 | checkpoint starting: end-of-recovery immediate
2019-11-15 22:16:08.751-05 | 27384 | checkpointer process (PID 27388) was terminated by signal 6: Aborted
2019-11-15 22:15:33.03-05 | 27388 | checkpoint starting: time
2019-11-15 22:15:03.62-05 | 27388 | checkpoint complete: wrote 5436 buffers (5.5%); 0 WAL file(s) added, 0 removed, 45 recycled; write=28.938 s, sync=0.355 s, total=29.711 s; sync files=22, longest=0.174 s, average=0.016 s; d
istance=740237 kB, estimate=740237 kB
2019-11-15 22:14:33.908-05 | 27388 | checkpoint starting: time

I was trying to reproduce what was happening:
set -x; psql postgres -txc "DROP TABLE IF EXISTS t" -c "CREATE TABLE t(i int unique); INSERT INTO t SELECT generate_series(1,999999)"; echo "begin;SELECT pg_export_snapshot(); SELECT pg_sleep(9)" |psql postgres -At >/tmp/snapshot& sleep 3; snap=`sed "1{/BEGIN/d}; q" /tmp/snapshot`; PGOPTIONS='-cclient_min_messages=debug' psql postgres -txc "ALTER TABLE t ALTER i TYPE bigint" -c CHECKPOINT; pg_dump -d postgres -t t --snap="$snap" |head -44;

Under v12, with or without the CHECKPOINT command, it fails:
|pg_dump: error: query failed: ERROR: cache lookup failed for index 0
But under v9.5.2 (which I found quickly), without CHECKPOINT, it instead fails like:
|pg_dump: [archiver (db)] query failed: ERROR: cache lookup failed for index 16391
With the CHECKPOINT command, 9.5.2 works, but I don't see why it should be
needed, or why it would behave differently (or if it's related to this crash).

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2019-11-20 02:11:16 Re: initdb SegFault
Previous Message imai.yoshikazu@fujitsu.com 2019-11-20 01:06:42 RE: Planning counters in pg_stat_statements (using pgss_store)