Re: should crash recovery ignore checkpoint_flush_after ?

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Subject: Re: should crash recovery ignore checkpoint_flush_after ?
Date: 2020-01-18 20:11:12
Message-ID: 20200118201111.GP26045@telsasoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Jan 18, 2020 at 10:48:22AM -0800, Andres Freund wrote:
> Hi,
>
> On 2020-01-18 08:08:07 -0600, Justin Pryzby wrote:
> > One of our PG12 instances was in crash recovery for an embarassingly long time
> > after hitting ENOSPC. (Note, I first started wroting this mail 10 months ago
> > while running PG11 after having same experience after OOM). Running linux.
> >
> > As I understand, the first thing that happens syncing every file in the data
> > dir, like in initdb --sync. These instances were both 5+TB on zfs, with
> > compression, so that's slow, but tolerable, and at least understandable, and
> > with visible progress in ps.
> >
> > The 2nd stage replays WAL. strace show's it's occasionally running
> > sync_file_range, and I think recovery might've been several times faster if
> > we'd just dumped the data at the OS ASAP, fsync once per file. In fact, I've
> > just kill -9 the recovery process and edited the config to disable this lest it
> > spend all night in recovery.
>
> I'm not quite sure what you mean here with "fsync once per file". The
> sync_file_range doesn't actually issue an fsync, even if sounds like it.

I mean if we didn't call sync_file_range() and instead let the kernel handle
the writes and then fsync() at end of checkpoint, which happens in any case.
I think I'll increase or maybe disable this GUC on our servers and, if needed,
adjust /proc/sys/vm/dirty_*ratio.

> It's ossible that ZFS's compression just does broken things here, I
> don't know.

Or, our settings aren't ideal or recovery is just going to perform poorly for
that. Which I'm ok with, since it should be rare anyway, and recovery is
unlikely to be a big deal for us.

> > At least, if this setting is going to apply during
> > recovery, the documentation should mention it (it's a "recovery checkpoint")
>
> That makes sense.

Find attached.
I modified a 2nd sentence since "that" was ambiguous, and could be read to
refer to "stalls".

@@ -2994,17 +2994,19 @@ include_dir 'conf.d'
Whenever more than this amount of data has been
written while performing a checkpoint, attempt to force the
OS to issue these writes to the underlying storage. Doing so will
limit the amount of dirty data in the kernel's page cache, reducing
the likelihood of stalls when an <function>fsync</function> is issued at the end of the
checkpoint, or when the OS writes data back in larger batches in the
- background. Often that will result in greatly reduced transaction
+ background. This feature will often result in greatly reduced transaction
latency, but there also are some cases, especially with workloads
that are bigger than <xref linkend="guc-shared-buffers"/>, but smaller
than the OS's page cache, where performance might degrade. This
setting may have no effect on some platforms.
+ This setting also applies to the checkpoint written at the end of crash
+ recovery.
If this value is specified without units, it is taken as blocks,
that is <symbol>BLCKSZ</symbol> bytes, typically 8kB.
The valid range is
between <literal>0</literal>, which disables forced writeback,
and <literal>2MB</literal>. The default is <literal>256kB</literal> on
Linux, <literal>0</literal> elsewhere. (If <symbol>BLCKSZ</symbol> is not

What about also updating PS following the last xlog replayed ?
Otherwise it shows "recovering <file>" for the duration of the recovery
checkpoint.

--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7628,3 +7628,6 @@ StartupXLOG(void)
else
+ {
+ set_ps_display("recovery checkpoint", false);
CreateCheckPoint(CHECKPOINT_END_OF_RECOVERY | CHECKPOINT_IMMEDIATE);
+ }

> > 4bc0f16 Change default of backend_flush_after GUC to 0 (disabled).
>
> FWIW, I still think this is the wrong default, and that it causes our
> users harm.

I have no opinion about the default, but the maximum seems low, as a maximum.
Why not INT_MAX, like wal_writer_flush_after ?

src/include/pg_config_manual.h:#define WRITEBACK_MAX_PENDING_FLUSHES 256

Thanks,
Justin

Attachment Content-Type Size
v1-0001-Document-that-checkpoint_flush_after-applies-to-e.patch text/x-diff 1.4 KB
v1-0002-Avoid-ambiguous-that.patch text/x-diff 1.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2020-01-18 20:44:52 Re: [HACKERS] Block level parallel vacuum
Previous Message Andres Freund 2020-01-18 18:48:22 Re: should crash recovery ignore checkpoint_flush_after ?