Re: Missing data_sync_elevel() for some calls of pg_fsync()?

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Craig Ringer <craig(at)2ndquadrant(dot)com>
Subject: Re: Missing data_sync_elevel() for some calls of pg_fsync()?
Date: 2019-12-02 05:43:23
Message-ID: CA+hUKGJ=voCaHai6MnZvWNsJosAh1gXDZOuHiaEqYgxYrX=M7Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Dec 2, 2019 at 5:58 PM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> I was just looking at some callers of pg_fsync(), to notice that some
> code paths don't use data_sync_elevel(). For some code paths, that's
> actually better to never PANIC (say backup_label file, logical
> decoding snapshot, lock file where FATAL/LOG are used now, etc.).
> However I have spotted three code paths where this is not done and I
> think that's not fine:
> - 2PC file generated at checkpoint time.
> - WAL segment initialization.
> - Temporary state file for a replication slot save, which may cause
> ERRORs at checkpoint time.

One of the distinctions I had in mind when reviewing/working on the
PANIC stuff was this:

1. Some code opens a file, writes some stuff to it, closes, and then
fsyncs it, and if that fails and and it ever retries it'll redo all of
those steps again. We know that some kernels might have thrown away
the data, but we don't care about the copy in the kernel's cache
because we'll write it out again next time around.

2. Some code, primarily buffer pool write-back code, writes data out
to the file, then throws away the only copy we have of it other than
the WAL by using the buffer for some other block, and then later
(usually in the checkpointer) fsyncs it. One way to look at it is
that if the fsync fails, the only place left to get that data (which
may represent committed transactions) is the WAL, and the only way to
get it is crash recovery. Another way to look at it is that if we
didn't PANIC, the checkpointer would try again, but it's easily
demonstrable that if it tries again, certain kernels will do nothing
and then tell you that it succeeded, so we need to prevent that or our
checkpoint would be a data-eating lie.

I didn't look closely at your patch, but I think those are category 1,
no? Admittedly there is an argument that we should panic in those
cases too, because otherwise we're second guessing how the kernel
works, and I did make a similar argument for why sync_file_range()
failure is panic-worthy.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2019-12-02 05:45:53 Re: Missing data_sync_elevel() for some calls of pg_fsync()?
Previous Message Amit Langote 2019-12-02 05:30:47 Re: pgbench -i progress output on terminal