Checkpoint not retrying failed fsync?

From: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Checkpoint not retrying failed fsync?
Date: 2018-04-05 22:16:20
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

This is only a preliminary report, I'm still trying to analyze what's
going on, but:

In doing testing on FreeBSD with a filesystem set up to induce errors
controllably (using gconcat+gnop), I can get this to happen (on 11devel):

(note that "mytable" is on a tablespace on the erroring filesystem,
while "x" is on a clean filesystem)

postgres=# insert into mytable values (-1);
postgres=# checkpoint;
ERROR: checkpoint request failed
HINT: Consult recent messages in the server log for details.
postgres=# insert into x values (3);
postgres=# checkpoint;

(the message in the server log is the expected one about fsync failing)

Checking the WAL shows that there is indeed a checkpoint record for the
second checkpoint and pg_control points to it, so a crash restart at
this point would not try and replay the "mytable" write.

Furthermore, checking the trace output from the checkpointer process, it
is not even attempting an fsync of the failing file; this isn't like the
Linux fsync issue, I've confirmed that fsync will repeatedly fail on the
file until the underlying errors stop.

As far as I can tell from reading the code, if a checkpoint fails the
checkpointer is supposed to keep all the outstanding fsync requests for
next time. Am I wrong, or is there some failure in the logic to do this?

Andrew (irc:RhodiumToad)


Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2018-04-05 22:35:12 Re: [HACKERS] path toward faster partition pruning
Previous Message David Rowley 2018-04-05 21:51:05 Re: Parallel Aggregates for string_agg and array_agg