Re: checkpointer: PANIC: could not fsync file: No such file or directory

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Thomas Munro <tmunro(at)postgresql(dot)org>
Subject: Re: checkpointer: PANIC: could not fsync file: No such file or directory
Date: 2019-11-19 20:26:53
Message-ID: CA+hUKG+cTv=SujoL6wW7YH_vVkVOYJ14+9w+AxYBnRLdVSf-YQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Nov 20, 2019 at 12:58 AM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> < 2019-11-15 22:16:07.098 EST >PANIC: could not fsync file "base/16491/1731839470.2": No such file or directory
> < 2019-11-15 22:16:08.751 EST >LOG: checkpointer process (PID 27388) was terminated by signal 6: Aborted
>
> /dev/vdb on /var/lib/pgsql type ext4 (rw,relatime,seclabel,data=ordered)
> Centos 7.7 qemu/KVM
> Linux database 3.10.0-1062.1.1.el7.x86_64 #1 SMP Fri Sep 13 22:55:44 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
> There's no added tablespaces.
>
> Copying Thomas since I wonder if this is related:
> 3eb77eba Refactor the fsync queue for wider use.

It could be, since it changed some details about the way that queue
worked, and another relevant change is:

9ccdd7f6 PANIC on fsync() failure.

Perhaps we should not panic if we failed to open (not fsync) the file,
but it's not the root problem here which is that somehow we thought we
should be fsyncing a file that had apparently been removed already
(due to CLUSTER, VACUUM FULL, DROP, rewriting ALTER etc). Usually, if
a file is in the fsync queue and then is later removed, we handle that
by searching the queue for cancellation messages (since one should
always be sent before the file is unlinked), and I think your core
file with "failures = 1" tells us that it didn't find a cancellation
message. So this seems to indicate a problem, somewhere, in that
protocol. That could either be a defect in 3eb77eba or it could have
been a pre-existing problem that became a bigger problem due to
9ccdd7f6.

Looking into it.

> I can't find any relation with filenode nor OID matching 1731839470, nor a file
> named like that (which is maybe no surprise, since it's exactly the issue
> checkpointer had last week).
>
> A backup job would've started at 22:00 and probably would've run until 22:27,
> except that its backend was interrupted "because of crash of another server
> process". That uses pg_dump --snapshot.
>
> This shows a gap of OIDs between 1721850297 and 1746136569; the tablenames
> indicate that would've been between 2019-11-15 01:30:03,192 and 04:31:19,348.
> |SELECT oid, relname FROM pg_class ORDER BY 1 DESC;

By the way, it's relfilenode, not oid, that is used in these names
(though they start out the same). In a rewrite, the relfilenode
changes but the oid stays the same.

> Ah, I found a maybe relevant log:
> |2019-11-15 22:15:59.592-05 | duration: 220283.831 ms statement: ALTER TABLE child.eric_enodeb_cell_201811 ALTER pmradiothpvolul
>
> So we altered that table (and 100+ others) with a type-promoting alter,
> starting at 2019-11-15 21:20:51,942. That involves DETACHing all but the most
> recent partitions, altering the parent, and then iterating over historic
> children to ALTER and reATTACHing them. (We do this to avoid locking the table
> for long periods, and to avoid worst-case disk usage).

Hmm.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2019-11-19 21:20:47 Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock(PG10.7)
Previous Message Thomas Munro 2019-11-19 19:34:39 Re: logical decoding : exceeded maxAllocatedDescs for .spill files