Re: checkpointer: PANIC: could not fsync file: No such file or directory

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: checkpointer: PANIC: could not fsync file: No such file or directory
Date: 2019-11-26 04:55:55
Message-ID: CA+hUKGLbC=+5DS8VOXqZ6peX_H6Zd_mxbV8vHAqS=ajM-x9wSQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Nov 26, 2019 at 5:21 PM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> I looked and found a new "hint".
>
> On Tue, Nov 19, 2019 at 05:57:59AM -0600, Justin Pryzby wrote:
> > < 2019-11-15 22:16:07.098 EST >PANIC: could not fsync file "base/16491/1731839470.2": No such file or directory
> > < 2019-11-15 22:16:08.751 EST >LOG: checkpointer process (PID 27388) was terminated by signal 6: Aborted
>
> An earlier segment of that relation had been opened successfully and was
> *still* opened:
>
> $ sudo grep 1731839470 /var/spool/abrt/ccpp-2019-11-15-22:16:08-27388/open_fds
> 63:/var/lib/pgsql/12/data/base/16491/1731839470
>
> For context:
>
> $ sudo grep / /var/spool/abrt/ccpp-2019-11-15-22:16:08-27388/open_fds |tail -3
> 61:/var/lib/pgsql/12/data/base/16491/1757077748
> 62:/var/lib/pgsql/12/data/base/16491/1756223121.2
> 63:/var/lib/pgsql/12/data/base/16491/1731839470
>
> So this may be an issue only with relations>segment (but, that interpretation
> could also be very naive).

FTR I have been trying to reproduce this but failing so far. I'm
planning to dig some more in the next couple of days. Yeah, it's a .2
file, which means that it's one that would normally be unlinked after
you commit your transaction (unlike a no-suffix file, which would
normally be dropped at the next checkpoint after the commit, as our
strategy to prevent the relfilenode from being reused before the next
checkpoint cycle), but should normally have had a SYNC_FORGET_REQUEST
enqueued for it first. So the question is, how did it come to pass
that a .2 file was ENOENT but there was no forget request? Diificult,
given the definition of mdunlinkfork(). I wondered if something was
going wrong in queue compaction or something like that, but I don't
see it. I need to dig into the exactly flow with the ALTER case to
see if there is something I'm missing there, and perhaps try
reproducing it with a tiny segment size to exercise some more
multisegment-related code paths.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2019-11-26 04:59:14 Re: accounting for memory used for BufFile during hash joins
Previous Message Michael Paquier 2019-11-26 04:34:59 Re: Safeguards against incorrect fd flags for fsync()