Re: should crash recovery ignore checkpoint_flush_after ?

From: Andres Freund <andres(at)anarazel(dot)de>
To: Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Subject: Re: should crash recovery ignore checkpoint_flush_after ?
Date: 2020-01-18 18:48:22
Message-ID: 20200118184822.fqyfqzx6hukhia6j@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2020-01-18 08:08:07 -0600, Justin Pryzby wrote:
> One of our PG12 instances was in crash recovery for an embarassingly long time
> after hitting ENOSPC. (Note, I first started wroting this mail 10 months ago
> while running PG11 after having same experience after OOM). Running linux.
>
> As I understand, the first thing that happens syncing every file in the data
> dir, like in initdb --sync. These instances were both 5+TB on zfs, with
> compression, so that's slow, but tolerable, and at least understandable, and
> with visible progress in ps.
>
> The 2nd stage replays WAL. strace show's it's occasionally running
> sync_file_range, and I think recovery might've been several times faster if
> we'd just dumped the data at the OS ASAP, fsync once per file. In fact, I've
> just kill -9 the recovery process and edited the config to disable this lest it
> spend all night in recovery.

I'm not quite sure what you mean here with "fsync once per file". The
sync_file_range doesn't actually issue an fsync, even if sounds like
it. In the case of checkpointing what it basically does is to ask the
kernel to please start writing data back immediately, instead of waiting
till the absolute end of the checkpoint, when doing fsyncs. IOW, the
data is going to be written back *anyway* in short order.

It's ossible that ZFS's compression just does broken things here, I
don't know.

> That GUC is intended to reduce latency spikes caused by checkpoint fsync. But
> I think limiting to default 256kB between syncs is too limiting during
> recovery, and at that point it's better to optimize for throughput anyway,
> since no other backends are running (in that instance) and cannot run until
> recovery finishes.

I don't think that'd be good by default - in my experience the stalls
caused by the kernel writing back massive amounts of data at once is
also problematic during recovery (and can lead to much higher %sys
too). You get the pattern of the fsync at the end taking forever, while
IO is idle before. And you'd get the latency spikes once recovery is
over too.

> At least, if this setting is going to apply during
> recovery, the documentation should mention it (it's a "recovery checkpoint")

That makes sense.

> See also
> 4bc0f16 Change default of backend_flush_after GUC to 0 (disabled).

FWIW, I still think this is the wrong default, and that it causes our
users harm. It only makes sense because the reverse was the default. But
it's easy to see quite massive stalls even on fast nvme SSDs (as in 10s
of no transactions committing, in an oltp workload). Nor do I think is
it really comparable with the checkpointing setting, because there we
*know* that we're about to fsync the file, whereas in the backend case
we might just use the fs page cache as an extension of shared buffers.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Pryzby 2020-01-18 20:11:12 Re: should crash recovery ignore checkpoint_flush_after ?
Previous Message Noah Misch 2020-01-18 18:28:21 Re: Allow relocatable extension to use @extschema@?