Re: checkpointer: PANIC: could not fsync file: No such file or directory

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: checkpointer: PANIC: could not fsync file: No such file or directory
Date: 2019-11-22 05:17:02
Message-ID: CAMsr+YG+WNf17xRwTZhSKgFP9p-PAxb9s1DqGZGqQ_NiVZTSPA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 21 Nov 2019 at 09:07, Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:

> On Tue, Nov 19, 2019 at 07:22:26PM -0600, Justin Pryzby wrote:
> > I was trying to reproduce what was happening:
> > set -x; psql postgres -txc "DROP TABLE IF EXISTS t" -c "CREATE TABLE t(i
> int unique); INSERT INTO t SELECT generate_series(1,999999)"; echo
> "begin;SELECT pg_export_snapshot(); SELECT pg_sleep(9)" |psql postgres -At
> >/tmp/snapshot& sleep 3; snap=`sed "1{/BEGIN/d}; q" /tmp/snapshot`;
> PGOPTIONS='-cclient_min_messages=debug' psql postgres -txc "ALTER TABLE t
> ALTER i TYPE bigint" -c CHECKPOINT; pg_dump -d postgres -t t --snap="$snap"
> |head -44;
> >
> > Under v12, with or without the CHECKPOINT command, it fails:
> > |pg_dump: error: query failed: ERROR: cache lookup failed for index 0
> > But under v9.5.2 (which I found quickly), without CHECKPOINT, it instead
> fails like:
> > |pg_dump: [archiver (db)] query failed: ERROR: cache lookup failed for
> index 16391
> > With the CHECKPOINT command, 9.5.2 works, but I don't see why it should
> be
> > needed, or why it would behave differently (or if it's related to this
> crash).
>
> Actually, I think that's at least related to documented behavior:
>
> https://www.postgresql.org/docs/12/mvcc-caveats.html
> |Some DDL commands, currently only TRUNCATE and the table-rewriting forms
> of ALTER TABLE, are not MVCC-safe. This means that after the truncation or
> rewrite commits, the table will appear empty to concurrent transactions, if
> they are using a snapshot taken before the DDL command committed.
>
> I don't know why CHECKPOINT allows it to work under 9.5, or if it's even
> related to the PANIC ..

The PANIC is a defense against potential corruptions that can be caused by
some kinds of disk errors. It's likely that we used to just ERROR and
retry, then the retry would succeed without getting upset.

fsync_fname() is supposed to ignore errors for files that cannot be opened.
But that same message may be emitted by a number of other parts of the
code, and it looks like you didn't have log_error_verbosity = verbose so we
don't have file/line info.

The only other place I see that emits that error where a relation path
could be a valid argument is in rewriteheap.c
in logical_end_heap_rewrite(). That calls the vfd layer's FileSync() and
assumes that any failure is a fsync() syscall failure. But FileSync() can
return failure if we fail to reopen the underlying file managed by the vfd
too, per FileAccess().

Would there be a legitimate case where a logical rewrite file mapping could
vanish without that being a problem? If so, we should probably be more
tolerante there.

--
Craig Ringer http://www.2ndQuadrant.com/
2ndQuadrant - PostgreSQL Solutions for the Enterprise

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2019-11-22 05:19:24 Re: ssl passphrase callback
Previous Message Pavel Stehule 2019-11-22 05:15:25 Re: Why overhead of SPI is so large?